# Pandas - Indexing, Selecting & Assigning 

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

In [1]:
import pandas as pd

In [5]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

Unnamed: 0,Yes,No
0,50,131
1,21,2


In [3]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

Unnamed: 0,Bob,Sue
0,I liked it.,Pretty good.
1,It was awful.,Bland.


In [9]:
df = pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

# Series
A Series is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list.

In [15]:
pd.Series([1, 2, 3, 4, 5])
print(pd.Series)

<class 'pandas.core.series.Series'>


A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

In [7]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

# Reading data files

Reading data from a csv file (e.g. Employee Attrition)

In [20]:
employee_attrition = pd.read_csv("Employee-attrition for Module 1.csv")

We can use the shape attribute to check how large the resulting DataFrame is with .shape:

In [21]:
employee_attrition.shape

(49653, 18)

This .shape tells us that this data has 49,653 records over 18 columns.

We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:

In [23]:
employee_attrition.head()

Unnamed: 0,EmployeeID,recorddate_key,birthdate_key,orighiredate_key,terminationdate_key,age,length_of_service,city_name,department_name,job_title,store_name,gender_short,gender_full,termreason_desc,termtype_desc,STATUS_YEAR,STATUS,BUSINESS_UNIT
0,1318,12/31/2006 0:00,1/3/1954,8/28/1989,1/1/1900,52,17,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2006,ACTIVE,HEADOFFICE
1,1318,12/31/2007 0:00,1/3/1954,8/28/1989,1/1/1900,53,18,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2007,ACTIVE,HEADOFFICE
2,1318,12/31/2008 0:00,1/3/1954,8/28/1989,1/1/1900,54,19,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2008,ACTIVE,HEADOFFICE
3,1318,12/31/2009 0:00,1/3/1954,8/28/1989,1/1/1900,55,20,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2009,ACTIVE,HEADOFFICE
4,1318,12/31/2010 0:00,1/3/1954,8/28/1989,1/1/1900,56,21,Vancouver,Executive,CEO,35,M,Male,Not Applicable,Not Applicable,2010,ACTIVE,HEADOFFICE


Alternatively, you can use .tail() to find the last five rows of data.

In [24]:
employee_attrition.tail()

Unnamed: 0,EmployeeID,recorddate_key,birthdate_key,orighiredate_key,terminationdate_key,age,length_of_service,city_name,department_name,job_title,store_name,gender_short,gender_full,termreason_desc,termtype_desc,STATUS_YEAR,STATUS,BUSINESS_UNIT
49648,8258,12/1/2015 0:00,5/28/1994,8/19/2013,12/30/2015,21,2,Valemount,Dairy,Dairy Person,34,M,Male,Layoff,Involuntary,2015,TERMINATED,STORES
49649,8264,8/1/2013 0:00,6/13/1994,8/27/2013,8/30/2013,19,0,Vancouver,Customer Service,Cashier,44,F,Female,Resignaton,Voluntary,2013,TERMINATED,STORES
49650,8279,12/1/2015 0:00,7/18/1994,9/15/2013,12/30/2015,21,2,White Rock,Customer Service,Cashier,39,F,Female,Layoff,Involuntary,2015,TERMINATED,STORES
49651,8296,12/1/2013 0:00,9/2/1994,10/9/2013,12/31/2013,19,0,Kelowna,Customer Service,Cashier,16,F,Female,Resignaton,Voluntary,2013,TERMINATED,STORES
49652,8321,12/1/2014 0:00,11/28/1994,11/24/2013,12/30/2014,20,1,Grand Forks,Customer Service,Cashier,13,F,Female,Layoff,Involuntary,2014,TERMINATED,STORES


In this example data (employee attrition) we can see that each row has already been assigned an index number (strating from 0 and ending with 49652) whoever, if this was not the case we can use index_col = 0 after the pd.read_csv function. This will make the first column of your data (column index 0), the index column.

# Indexing, Selecting & Assigning


### To select specific data within your DataFrame

Consider the DataFrame 'employee_attrition' above, to access a specific column we can use '.column name', e.g. to select only the birthdate column:

In [26]:
employee_attrition.birthdate_key

0          1/3/1954
1          1/3/1954
2          1/3/1954
3          1/3/1954
4          1/3/1954
            ...    
49648     5/28/1994
49649     6/13/1994
49650     7/18/1994
49651      9/2/1994
49652    11/28/1994
Name: birthdate_key, Length: 49653, dtype: object

We can also use the indexing function (square brackets), e.g.:

In [27]:
employee_attrition['birthdate_key']

0          1/3/1954
1          1/3/1954
2          1/3/1954
3          1/3/1954
4          1/3/1954
            ...    
49648     5/28/1994
49649     6/13/1994
49650     7/18/1994
49651      9/2/1994
49652    11/28/1994
Name: birthdate_key, Length: 49653, dtype: object

### Selecting specific values from a DataFrame 

To drill down to a single specific value, we need only use the indexing operator [] once more. For example, to find the first birthdate in the DataFrame we can use:

In [28]:
employee_attrition['birthdate_key'][0]

'1/3/1954'

## Indexing in pandas

## Index-based selection with iloc 

iloc = selecting data based on its numerical position in the DataFrame. To select the first row of data in employee_attrition we can use:

In [29]:
employee_attrition.iloc[0]

EmployeeID                        1318
recorddate_key         12/31/2006 0:00
birthdate_key                 1/3/1954
orighiredate_key             8/28/1989
terminationdate_key           1/1/1900
age                                 52
length_of_service                   17
city_name                    Vancouver
department_name              Executive
job_title                          CEO
store_name                          35
gender_short                         M
gender_full                       Male
termreason_desc         Not Applicable
termtype_desc           Not Applicable
STATUS_YEAR                       2006
STATUS                          ACTIVE
BUSINESS_UNIT               HEADOFFICE
Name: 0, dtype: object

We can see that this has provided us with the data from the 18 columns of the first row.

#### NOTE: Both loc and iloc are row-first, column-second

For example, to retrieve all rows of only the first column (employee id), we would use:

In [31]:
employee_attrition.iloc[:,0]

0        1318
1        1318
2        1318
3        1318
4        1318
         ... 
49648    8258
49649    8264
49650    8279
49651    8296
49652    8321
Name: EmployeeID, Length: 49653, dtype: int64

#### On its own, the : operator, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the birthdate_key column from just the first, second, and third row, we would do:

In [32]:
employee_attrition.iloc[:3, 3]

0    8/28/1989
1    8/28/1989
2    8/28/1989
Name: orighiredate_key, dtype: object

Or, to select just the fifth and sixth entries:

In [34]:
employee_attrition.iloc[4:6, 3]

4    8/28/1989
5    8/28/1989
Name: orighiredate_key, dtype: object

Note: Because of zero-indexing the fifth and sixth values are given the index numbers 4 and 5.

#### It's also possible to pass a list:

In [37]:
employee_attrition.iloc[[0, 1, 2, 3, 4], 3]

0    8/28/1989
1    8/28/1989
2    8/28/1989
3    8/28/1989
4    8/28/1989
Name: orighiredate_key, dtype: object

#### If you are looking for the last items in the dataset, you can use negative indexing (start at -1), for example to find the data of the last fice people in the data, we can use: 

In [41]:
employee_attrition.iloc[-5:]

Unnamed: 0,EmployeeID,recorddate_key,birthdate_key,orighiredate_key,terminationdate_key,age,length_of_service,city_name,department_name,job_title,store_name,gender_short,gender_full,termreason_desc,termtype_desc,STATUS_YEAR,STATUS,BUSINESS_UNIT
49648,8258,12/1/2015 0:00,5/28/1994,8/19/2013,12/30/2015,21,2,Valemount,Dairy,Dairy Person,34,M,Male,Layoff,Involuntary,2015,TERMINATED,STORES
49649,8264,8/1/2013 0:00,6/13/1994,8/27/2013,8/30/2013,19,0,Vancouver,Customer Service,Cashier,44,F,Female,Resignaton,Voluntary,2013,TERMINATED,STORES
49650,8279,12/1/2015 0:00,7/18/1994,9/15/2013,12/30/2015,21,2,White Rock,Customer Service,Cashier,39,F,Female,Layoff,Involuntary,2015,TERMINATED,STORES
49651,8296,12/1/2013 0:00,9/2/1994,10/9/2013,12/31/2013,19,0,Kelowna,Customer Service,Cashier,16,F,Female,Resignaton,Voluntary,2013,TERMINATED,STORES
49652,8321,12/1/2014 0:00,11/28/1994,11/24/2013,12/30/2014,20,1,Grand Forks,Customer Service,Cashier,13,F,Female,Layoff,Involuntary,2014,TERMINATED,STORES


## Label-based selection with loc

In this paradigm, it's the data index value, not its position, which matters.

For example, to get the first entry in employee_attrition, we would now do the following:

In [42]:
employee_attrition.loc[0, 'birthdate_key']

'1/3/1954'

Since your dataset usually has meaningful indices (or headings), it's usually easier to do things using loc. For example, here's one operation that's much easier using loc (selecting multiple columns of info):

In [44]:
employee_attrition.loc[:, ['birthdate_key', 'age', 'length_of_service']]

Unnamed: 0,birthdate_key,age,length_of_service
0,1/3/1954,52,17
1,1/3/1954,53,18
2,1/3/1954,54,19
3,1/3/1954,55,20
4,1/3/1954,56,21
...,...,...,...
49648,5/28/1994,21,2
49649,6/13/1994,19,0
49650,7/18/1994,21,2
49651,9/2/1994,19,0


Interestingly, from this data we can see a correlation between age and length of service (the older the person, the longer they stayed at the company).