# Indexing and Selecting Data from DataFrames

If you're familiar with the idea of a SQL table, or even an Excel spreadsheet, then you know that you will want to be able to select data in various ways.  Maybe you'd like to select certain rows, or perhaps certain columns. Maybe a combination of both.  `DataFrame` can do it all.

We will explore basic column selection using the DataFrame object's `[]` operator, and then we will have an in-depth look at the `DataFrame.loc[]` attribute which provides a powerful variety of ways to access portions of data in a `DataFrame`.

## Outline

* Selecting Data from DataFrames
* Our data
* Selecting Columns with `[]`
* Selecting Data with `loc[]`
* A Moment for `iloc[]`

## Selecting Data from DataFrames

If you're familiar with the idea of a SQL table, or even an Excel spreadsheet, then you know that you will want to be able to select data in various ways.  Maybe you'd like to select certain rows, or perhaps certain columns. Maybe a combination of both.  `DataFrame` can do it all.

We will explore basic column selection using the DataFrame object's `[]` operator, and then we will have an in-depth look at the `DataFrame.loc[]` attribute which provides a powerful variety of ways to access portions of data in a `DataFrame`.

## Our data

Let's load our data from a .csv file.  We end up with a Dataframe of Employee records. 

In [None]:
import pandas as pd
from pathlib import Path

df = pd.read_csv(Path('data/employee_attrition.csv'))

In [38]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In [39]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Selecting columns with `[]`

Use column names to select one or more columns from a `DataFrame`.

In [42]:
age = df[["Age","Attrition"]]

print(type(age))
age.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Age,Attrition
0,41,Yes
1,49,No
2,37,Yes
3,33,No
4,27,No


pass a `list` to select multiple columns

In [13]:
sub_df = df[["Age", "Attrition", "DailyRate"]]

print(type(sub_df))
sub_df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Age,Attrition,DailyRate
0,41,Yes,1102
1,49,No,279
2,37,Yes,1373
3,33,No,1392
4,27,No,591


Get creative with Python list comprehensions to select columns dynamically.

In [58]:
year_data = df[[col for col in df.columns if col.startswith('Years')]]

year_data.head()


Unnamed: 0,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,6,4,0,5
1,10,7,1,7
2,0,0,0,0
3,8,7,3,0
4,2,2,2,2


In [64]:
# Using list comprehension
df[[c for c in df.columns if c.endswith('Satisfaction')]]

Unnamed: 0,EnvironmentSatisfaction,JobSatisfaction,RelationshipSatisfaction
0,2,4,1
1,3,2,4
2,4,3,2
3,4,3,3
4,1,2,4
...,...,...,...
1465,3,4,3
1466,4,1,1
1467,2,2,2
1468,4,2,4


## Selecting data with `loc[]`

The `loc` attribute on a `DataFrame` provides label-based access of data.  In most cases, rows are labeled with default row index integers that start at 0 and increment by 1 for each row, and columns are labeled with descriptive strings. So, when specifying a row we will use an integer value, and when specifying a column we will use a descriptive string like 'DistanceFromHome'.

You always access `loc` with `[]`, however it accepts various types of input and will return differently structured responses accordingly.


### individual row selection with `loc[row]`
Simply pass an integer as a row label to select a row of data:

In [73]:
row = df.loc[17]

print(type(row))
row.head()

<class 'pandas.core.series.Series'>


Age                                   22
Attrition                             No
BusinessTravel                Non-Travel
DailyRate                           1123
Department        Research & Development
Name: 17, dtype: object

In [77]:
df.loc[[1,8],['Age']]

Unnamed: 0,Age
1,49
8,38


### select many rows with `loc[start:finish]`
Pass a slice object of integer row labels to select more than one row.
Notice that both values of the slice object are included in the response, unlike when using `list`.

In [78]:
rows = df.loc[12:15]

print(type(rows))
rows

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
12,31,No,Travel_Rarely,670,Research & Development,26,1,Life Sciences,1,16,...,4,80,1,5,1,2,5,2,4,3
13,34,No,Travel_Rarely,1346,Research & Development,19,2,Medical,1,18,...,3,80,1,3,2,3,2,2,1,2
14,28,Yes,Travel_Rarely,103,Research & Development,24,3,Life Sciences,1,19,...,2,80,0,6,4,3,4,2,0,3
15,29,No,Travel_Rarely,1389,Research & Development,21,4,Life Sciences,1,20,...,3,80,1,10,1,3,10,9,8,8


### select single value with `loc[row, column]`
There is also a second parameter to `loc[]`, where you can specify the columns desired.  If we pass a single value for both parameters, we can select a single scalar value in the `DataFrame`:

In [79]:
age_of_employee_17 = df.loc[17, 'Age']

print(type(age_of_employee_17))
age_of_employee_17

<class 'numpy.int64'>


22

### selecting multiple rows, and multiple columns using `loc[rows, columns]`

Great, now we can use either a single value, a list of values, or a slice object for either the rows or the columns parameters.

In [80]:
df.loc[25:30, 'Age':'EnvironmentSatisfaction']

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction
25,53,No,Travel_Rarely,1282,Research & Development,5,3,Other,1,32,3
26,32,Yes,Travel_Frequently,1125,Research & Development,16,1,Life Sciences,1,33,2
27,42,No,Travel_Rarely,691,Sales,8,4,Marketing,1,35,3
28,44,No,Travel_Rarely,477,Research & Development,7,4,Medical,1,36,1
29,46,No,Travel_Rarely,705,Sales,2,4,Marketing,1,38,2
30,33,No,Travel_Rarely,924,Research & Development,2,3,Medical,1,39,3


In [82]:
df.loc[1:10, 'Age':'DailyRate']

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate
1,49,No,Travel_Frequently,279
2,37,Yes,Travel_Rarely,1373
3,33,No,Travel_Frequently,1392
4,27,No,Travel_Rarely,591
5,32,No,Travel_Frequently,1005
6,59,No,Travel_Rarely,1324
7,30,No,Travel_Rarely,1358
8,38,No,Travel_Frequently,216
9,36,No,Travel_Rarely,1299
10,35,No,Travel_Rarely,809


In [83]:
# You have decided the order here
df.loc[[13, 75, 22, 11], ['Age', 'Attrition', 'HourlyRate']]

Unnamed: 0,Age,Attrition,HourlyRate
13,34,No,93
75,31,No,61
22,34,No,53
11,29,No,49


In [111]:
r = [['Joe', 33],['Alice', 39], ['Steve',22],['Jennie',42]]
e = pd.DataFrame(r, columns="Name Age".split())
e

Unnamed: 0,Name,Age
0,Joe,33
1,Alice,39
2,Steve,22
3,Jennie,42


In [113]:
e[e['Age']>40]

Unnamed: 0,Name,Age
3,Jennie,42


### Selecting rows with a condition

Just to keep you on your toes, `loc[]` can also accept a _completely different_ type of input.  If you pass an list-like collection of Boolean values, it will return a DataFrame including rows that correspond to the `True` values in the input list.


Here's a simple example to demonstrate:

In [114]:
example = pd.DataFrame({
    'Name': ['Joe', 'Alice', 'Steve', 'Jennie'],
    'Age': [33, 39, 22, 42]
})

over_30 = example.loc[[True, True, False, True]]

print(type(over_30))
over_30

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age
0,Joe,33
1,Alice,39
3,Jennie,42


Okay, okay, that was silly.  We're not going to be constructing lists of Boolean values manually. The above example would be much simpler as `example.loc[[0,1,3]]`.  However, consider the following code:

In [115]:
is_over_30 = example.Age > 30

print(type(is_over_30))
is_over_30

<class 'pandas.core.series.Series'>


0     True
1     True
2    False
3     True
Name: Age, dtype: bool

Now we can write this nicely readable bit of code, which will select all rows where the Age column is greater than 30.

In [116]:
over_30 = example.loc[example.Age > 30]

print(type(over_30))
over_30

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age
0,Joe,33
1,Alice,39
3,Jennie,42


This becomes much more powerful as we deal with larger data sets

In [122]:
print(df.shape)
filtered = df
filtered = filtered.loc[filtered.Age > 30] # only older than 30

filtered = filtered.loc[filtered.Age <= 40] # only 40 and younger
filtered = filtered.loc[filtered.BusinessTravel == 'Travel_Rarely'] # only 'Travel_Rarely'

print(filtered.shape)  # notice the row count goes from 1470 to 423
filtered

(1470, 35)
(423, 35)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
9,36,No,Travel_Rarely,1299,Research & Development,27,3,Medical,1,13,...,2,80,2,17,3,2,7,7,7,7
10,35,No,Travel_Rarely,809,Research & Development,16,3,Medical,1,14,...,3,80,1,6,5,3,5,4,0,3
12,31,No,Travel_Rarely,670,Research & Development,26,1,Life Sciences,1,16,...,4,80,1,5,1,2,5,2,4,3
13,34,No,Travel_Rarely,1346,Research & Development,19,2,Medical,1,18,...,3,80,1,3,2,3,2,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1457,40,No,Travel_Rarely,1194,Research & Development,2,4,Medical,1,2051,...,2,80,3,20,2,3,5,3,0,2
1458,35,No,Travel_Rarely,287,Research & Development,1,4,Life Sciences,1,2052,...,4,80,1,4,5,3,4,3,1,1
1462,39,No,Travel_Rarely,722,Sales,24,1,Marketing,1,2056,...,1,80,1,21,2,2,20,9,9,6
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7


In [141]:
#df.shape
df[(df['Age']>30)&\
   (df['Age']<=40)&\
   (df['BusinessTravel']=='Travel_Rarely')]

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
9,36,No,Travel_Rarely,1299,Research & Development,27,3,Medical,1,13,...,2,80,2,17,3,2,7,7,7,7
10,35,No,Travel_Rarely,809,Research & Development,16,3,Medical,1,14,...,3,80,1,6,5,3,5,4,0,3
12,31,No,Travel_Rarely,670,Research & Development,26,1,Life Sciences,1,16,...,4,80,1,5,1,2,5,2,4,3
13,34,No,Travel_Rarely,1346,Research & Development,19,2,Medical,1,18,...,3,80,1,3,2,3,2,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1457,40,No,Travel_Rarely,1194,Research & Development,2,4,Medical,1,2051,...,2,80,3,20,2,3,5,3,0,2
1458,35,No,Travel_Rarely,287,Research & Development,1,4,Life Sciences,1,2052,...,4,80,1,4,5,3,4,3,1,1
1462,39,No,Travel_Rarely,722,Sales,24,1,Marketing,1,2056,...,1,80,1,21,2,2,20,9,9,6
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7


## A moment for `iloc[]`

Alongside the `loc[]` attribute, DataFrames also have `iloc[]`.  This attribute is analagous to to `loc`, however it operates strictly on positional integer values for locating rows and columns.  It's important to note that we use integer values for the rows previously when using `loc[]`, however that was only to match the data type of the row index of the DataFrame.  We will use a string value for this example to demonstrate the difference between `loc` and `iloc`.

In [138]:
df2 = pd.DataFrame({
    'Name': ['Joe', 'Alice', 'Steve', 'Jennie'],
    'Age': [33, 39, 22, 42]
}, index=[list('abcd')])

df2

Unnamed: 0,Name,Age
a,Joe,33
b,Alice,39
c,Steve,22
d,Jennie,42


Notice we now have string values for our row index.  This changes how we have to use `loc`

In [139]:
# df2.loc[[0, 1, 2] would throw KeyError, because those values arent present in the row index

df2.loc[['a', 'c', 'd']]

Unnamed: 0,Name,Age
a,Joe,33
c,Steve,22
d,Jennie,42


However, with `iloc`, we can always use integers in order to retrieve the rows and columns by their position

In [140]:
df2.iloc[[0,2,3], [1]]

Unnamed: 0,Age
a,33
c,22
d,42


In [170]:
dc = {"Name":"John Alex Sarah Jessica".split(),\
      "Movie":"TopGun Titanic LOTR HarryPotter".split(),
       "Age":[23, 26, 27, 21]}

In [175]:
myd = pd.DataFrame(dc)

In [176]:
myd[['Movie']]

Unnamed: 0,Movie
0,TopGun
1,Titanic
2,LOTR
3,HarryPotter


In [177]:
myd

Unnamed: 0,Name,Movie,Age
0,John,TopGun,23
1,Alex,Titanic,26
2,Sarah,LOTR,27
3,Jessica,HarryPotter,21


In [181]:
myd.loc[2]

Name     Sarah
Movie     LOTR
Age         27
Name: 2, dtype: object

In [183]:
myd.iloc[2]

Name     Sarah
Movie     LOTR
Age         27
Name: 2, dtype: object

In [189]:
(myd['Age']>23) & (myd['Age'] < 5500)

0    False
1     True
2     True
3    False
Name: Age, dtype: bool

In [195]:
c = {"Name":["John","Joe","Alan","Steve"],
    "Salary":[2000, 4000, 4500, 6000]}

In [196]:
df = pd.DataFrame(c)
df

Unnamed: 0,Name,Salary
0,John,2000
1,Joe,4000
2,Alan,4500
3,Steve,6000


In [200]:
df.Salary > 3000 && df.Salary < 5500

SyntaxError: invalid syntax (<ipython-input-200-cb500a3f0d64>, line 1)

In [205]:
df[Salary] > 3000 && df[Salary] < 5500

SyntaxError: invalid syntax (<ipython-input-205-04d6d814602e>, line 1)

In [216]:
df[(df['Salary'] > 3000) & (df['Salary'] < 5500)]

Unnamed: 0,Name,Salary
1,Joe,4000
2,Alan,4500


In [217]:
df[(df.Salary > 3000) & (df.Salary < 5500)]

Unnamed: 0,Name,Salary
1,Joe,4000
2,Alan,4500
