# Indexing and Selecting Data from DataFrames

If you're familiar with the idea of a SQL table, or even an Excel spreadsheet, then you know that you will want to be able to select data in various ways.  Maybe you'd like to select certain rows, or perhaps certain columns. Maybe a combination of both.  `DataFrame` can do it all.

We will explore basic column selection using the DataFrame object's `[]` operator, and then we will have an in-depth look at the `DataFrame.loc[]` attribute which provides a powerful variety of ways to access portions of data in a `DataFrame`.

## Outline

* Selecting Data from DataFrames
* Our data
* Selecting Columns with `[]`
* Selecting Data with `loc[]`
* A Moment for `iloc[]`

## Selecting Data from DataFrames

If you're familiar with the idea of a SQL table, or even an Excel spreadsheet, then you know that you will want to be able to select data in various ways.  Maybe you'd like to select certain rows, or perhaps certain columns. Maybe a combination of both.  `DataFrame` can do it all.

We will explore basic column selection using the DataFrame object's `[]` operator, and then we will have an in-depth look at the `DataFrame.loc[]` attribute which provides a powerful variety of ways to access portions of data in a `DataFrame`.

## Our data

Let's load our data from a .csv file.  We end up with a Dataframe of Employee records. 

In [4]:
import pandas as pd
from pathlib import Path

df = pd.read_csv(Path('data/employee_attrition.csv'))
print(type(df))
df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2


## Selecting columns with `[]`

Use column names to select one or more columns from a `DataFrame`.

In [20]:
age = df["Age"]

print(type(age))
age.head()

<class 'pandas.core.series.Series'>


0    41
1    49
2    37
3    33
4    27
Name: Age, dtype: int64

pass a `list` to select multiple columns

In [21]:
sub_df = df[["Age", "Attrition", "DailyRate"]]

print(type(sub_df))
sub_df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Age,Attrition,DailyRate
0,41,Yes,1102
1,49,No,279
2,37,Yes,1373
3,33,No,1392
4,27,No,591


Get creative with Python list comprehensions to select columns dynamically.

In [22]:
year_data = df[[col for col in df.columns if col.startswith('Years')]]

year_data.head()


Unnamed: 0,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,6,4,0,5
1,10,7,1,7
2,0,0,0,0
3,8,7,3,0
4,2,2,2,2


## Selecting data with `loc[]`

The `loc` attribute on a `DataFrame` provides label-based access of data.  In most cases, rows are labeled with default row index integers that start at 0 and increment by 1 for each row, and columns are labeled with descriptive strings. So, when specifying a row we will use an integer value, and when specifying a column we will use a descriptive string like 'DistanceFromHome'.

You always access `loc` with `[]`, however it accepts various types of input and will return differently structured responses accordingly.


### individual row selection with `loc[row]`
Simply pass an integer as a row label to select a row of data:

In [81]:
row = df.loc[17]

print(type(row))
row.head()

<class 'pandas.core.series.Series'>


Age                         22
Attrition                   No
BusinessTravel      Non-Travel
DailyRate                 1123
DistanceFromHome            16
Name: 17, dtype: object

### select many rows with `loc[start:finish]`
Pass a slice object of integer row labels to select more than one row.
Notice that both values of the slice object are included in the response, unlike when using `list`.

In [82]:
rows = df.loc[12:15]

print(type(rows))
rows

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
12,31,No,Travel_Rarely,670,26,1,Male,31,3,1,...,3,4,1,5,1,2,5,2,4,3
13,34,No,Travel_Rarely,1346,19,2,Male,93,3,1,...,3,3,1,3,2,3,2,2,1,2
14,28,Yes,Travel_Rarely,103,24,3,Male,50,2,1,...,3,2,0,6,4,3,4,2,0,3
15,29,No,Travel_Rarely,1389,21,2,Female,51,4,3,...,3,3,1,10,1,3,10,9,8,8


### select single value with `loc[row, column]`
There is also a second parameter to `loc[]`, where you can specify the columns desired.  If we pass a single value for both parameters, we can select a single scalar value in the `DataFrame`:

In [83]:
age_of_employee_17 = df.loc[17, 'Age']

print(type(age_of_employee_17))
age_of_employee_17

<class 'numpy.int64'>


22

### selecting multiple rows, and multiple columns using `loc[rows, columns]`

Great, now we can use either a single value, a list of values, or a slice object for either the rows or the columns parameters.

In [84]:
df.loc[25:30, 'Age':'EnvironmentSatisfaction']

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction
25,53,No,Travel_Rarely,1282,5,3
26,32,Yes,Travel_Frequently,1125,16,2
27,42,No,Travel_Rarely,691,8,3
28,44,No,Travel_Rarely,477,7,1
29,46,No,Travel_Rarely,705,2,2
30,33,No,Travel_Rarely,924,2,3


In [85]:
df.loc[[13, 75, 22, 11], ['Age', 'Attrition', 'HourlyRate']]

Unnamed: 0,Age,Attrition,HourlyRate
13,34,No,93
75,31,No,61
22,34,No,53
11,29,No,49


### Selecting rows with a condition

Just to keep you on your toes, `loc[]` can also accept a _completely different_ type of input.  If you pass an list-like collection of Boolean values, it will return a DataFrame including rows that correspond to the `True` values in the input list.


Here's a simple example to demonstrate:

In [61]:
example = pd.DataFrame({
    'Name': ['Joe', 'Alice', 'Steve', 'Jennie'],
    'Age': [33, 39, 22, 42]
})

over_30 = example.loc[[True, True, False, True]]

print(type(over_30))
over_30

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age
0,Joe,33
1,Alice,39
3,Jennie,42


Okay, okay, that was silly.  We're not going to be constructing lists of Boolean values manually. The above example would be much simpler as `example.loc[[0,1,3]]`.  However, consider the following code:

In [62]:
is_over_30 = example.Age > 30

print(type(is_over_30))
is_over_30

<class 'pandas.core.series.Series'>


0     True
1     True
2    False
3     True
Name: Age, dtype: bool

Now we can write this nicely readable bit of code, which will select all rows where the Age column is greater than 30.

In [63]:
over_30 = example.loc[example.Age > 30]

print(type(over_30))
over_30

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age
0,Joe,33
1,Alice,39
3,Jennie,42


This becomes much more powerful as we deal with larger data sets

In [64]:
print(df.shape)
filtered = df
filtered = filtered.loc[filtered.Age > 30] # only older than 30

filtered = filtered.loc[filtered.Age <= 40] # only 40 and younger
filtered = filtered.loc[filtered.BusinessTravel == 'Travel_Rarely'] # only 'Travel_Rarely'

print(filtered.shape)  # notice the row count goes from 1470 to 423
filtered

(1470, 26)
(423, 26)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
9,36,No,Travel_Rarely,1299,27,3,Male,94,3,2,...,3,2,2,17,3,2,7,7,7,7
10,35,No,Travel_Rarely,809,16,1,Male,84,4,1,...,3,3,1,6,5,3,5,4,0,3
12,31,No,Travel_Rarely,670,26,1,Male,31,3,1,...,3,4,1,5,1,2,5,2,4,3
13,34,No,Travel_Rarely,1346,19,2,Male,93,3,1,...,3,3,1,3,2,3,2,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1457,40,No,Travel_Rarely,1194,2,3,Female,98,3,1,...,3,2,3,20,2,3,5,3,0,2
1458,35,No,Travel_Rarely,287,1,3,Female,62,1,1,...,3,4,1,4,5,3,4,3,1,1
1462,39,No,Travel_Rarely,722,24,2,Female,60,2,4,...,3,1,1,21,2,2,20,9,9,6
1466,39,No,Travel_Rarely,613,6,4,Male,42,2,3,...,3,1,1,9,5,3,7,7,1,7


## A moment for `iloc[]`

Alongside the `loc[]` attribute, DataFrames also have `iloc[]`.  This attribute is analagous to to `loc`, however it operates strictly on positional integer values for locating rows and columns.  It's important to note that we use integer values for the rows previously when using `loc[]`, however that was only to match the data type of the row index of the DataFrame.  We will use a string value for this example to demonstrate the difference between `loc` and `iloc`.

In [78]:
df2 = pd.DataFrame({
    'Name': ['Joe', 'Alice', 'Steve', 'Jennie'],
    'Age': [33, 39, 22, 42]
}, index=[list('abcd')])

df2

Unnamed: 0,Name,Age
a,Joe,33
b,Alice,39
c,Steve,22
d,Jennie,42


Notice we now have string values for our row index.  This changes how we have to use `loc`

In [79]:
# df2.loc[[0, 1, 2] would throw KeyError, because those values arent present in the row index

df2.loc[['a', 'c', 'd']]

Unnamed: 0,Name,Age
a,Joe,33
c,Steve,22
d,Jennie,42


However, with `iloc`, we can always use integers in order to retrieve the rows and columns by their position

In [80]:
df2.iloc[[0,2,3], [1]]

Unnamed: 0,Age
a,33
c,22
d,42
