---
title: "Selection: subset of columns"
toc: true
---

To select a column in a `DataFrame`, we can use the bracket notation. That is, name of the DataFrame followed by the column name in square brackets: `df['column_name']`. 

<center><img src="https://pandas.pydata.org/docs/_images/03_subset_columns.svg" width="85%" style="filter:invert(1)"></center>

For example, to select a column named `Candidate` from the `election` DataFrame, we can use the following code:

In [None]:
candidates = elections['Candidate']
print(candidates)

0         Andrew Jackson
1      John Quincy Adams
2         Andrew Jackson
3      John Quincy Adams
4         Andrew Jackson
             ...        
177           Jill Stein
178         Joseph Biden
179         Donald Trump
180         Jo Jorgensen
181       Howard Hawkins
Name: Candidate, Length: 182, dtype: object


This extracts a single column as a `Series`. We can confirm this by checking the type of the output.

In [None]:
type(candidates)

pandas.core.series.Series

To select multiple columns, we can pass a list of column names. For example, to select both `Candidate` and `Votes` columns from the `election` DataFrame, we can use the following line of code:



In [None]:
elections[['Candidate', 'Party']]

Unnamed: 0,Candidate,Party
0,Andrew Jackson,Democratic-Republican
1,John Quincy Adams,Democratic-Republican
2,Andrew Jackson,Democratic
3,John Quincy Adams,National Republican
4,Andrew Jackson,Democratic
...,...,...
177,Jill Stein,Green
178,Joseph Biden,Democratic
179,Donald Trump,Republican
180,Jo Jorgensen,Libertarian


This extracts multiple columns as a `DataFrame`. We can confirm as well this by checking the type of the output.

In [None]:
type(elections[['Candidate', 'Party']])

This is how we can select columns in a `DataFrame`. Next, let's learn how to filter rows.


### `[]`

The `[]` selection operator is the most baffling of all, yet the most commonly used. It only takes a single argument, which may be one of the following:

1. A slice of row numbers
2. A list of column labels
3. A single column label

That is, `[]` is _context dependent_. Let’s see some examples.

Say we wanted the first four rows of our `elections` DataFrame.

In [None]:
elections[0:4]

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073


### `.drop_duplicates()`

If we have a DataFrame with many repeated rows, then [`.drop_duplicates()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) can be used to remove the repeated rows.

Where `.unique()` only works with individual columns (Series) and returns an array of unique values, `.drop_duplicates()` can be used with multiple columns (DataFrame) and returns a DataFrame with the repeated rows removed.

In [None]:
elections[['Candidate', 'Party']].drop_duplicates()

Unnamed: 0,Candidate,Party
0,Andrew Jackson,Democratic-Republican
1,John Quincy Adams,Democratic-Republican
2,Andrew Jackson,Democratic
3,John Quincy Adams,National Republican
5,Henry Clay,National Republican
...,...,...
174,Evan McMullin,Independent
176,Hillary Clinton,Democratic
178,Joseph Biden,Democratic
180,Jo Jorgensen,Libertarian


### `.sample()`

As we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). [`.sample()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series).

By default, `.sample()` selects entries *without* replacement. Pass in the argument `replace=True` to sample with replacement.

In [None]:
# Sample a single row
elections.sample()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
135,1988,George H. W. Bush,Republican,48886597,win,53.518845


In [None]:
# Sample 5 random rows
elections.sample(5)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
155,2000,Ralph Nader,Green,2882955,loss,2.741176
134,1984,Walter Mondale,Democratic,37577352,loss,40.729429
39,1884,Grover Cleveland,Democratic,4914482,win,48.884933
84,1928,Herbert Hoover,Republican,21427123,win,58.368524
177,2016,Jill Stein,Green,1457226,loss,1.073699


In [None]:
# Randomly sample 4 names from the year 2000, with replacement
elections[elections["Result"] == "win"].sample(4, replace = True)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
53,1896,William McKinley,Republican,7112138,win,51.213817
131,1980,Ronald Reagan,Republican,43903230,win,50.897944
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
168,2012,Barack Obama,Democratic,65915795,win,51.258484
