In [1]:
# Make sure pandas is loaded
import pandas as pd

# Read in the survey CSV
surveys_df = pd.read_csv("data/raw/surveys.csv")

# Indexing and Slicing in Python

## Selecting data using Labels (Column Headings)
We use square brackets `[]` to select a subset of python objects.  we can select all data from a column named species_id from the surveys_df DataFrame by name. There are two ways to do this:

In [3]:
# TIP: use the .head() method we saw earlier to make output shorter
# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id']

# Method 2: use the column name as an 'attribute'; gives the same output
surveys_df.species_id.head()

0    NL
1    NL
2    DM
3    DM
4    DM
Name: species_id, dtype: object

We can pass a list of column names too, as an index to select columns in that order. This is useful when we need to reorganize our data.

**NOTE:** If a column name is not contained in the DataFrame, an exception (error) will be raised.

In [6]:
# Select the species and plot columns from the DataFrame
surveys_df[['species_id', 'plot_id']].head()

# What happens when you flip the order?
#surveys_df[['plot_id', 'species_id']]

# What happens if you ask for a column that doesn't exist?
#surveys_df['speciess']

Unnamed: 0,species_id,plot_id
0,NL,2
1,NL,3
2,DM,2
3,DM,7
4,DM,3


## Extracting Range Based Subsets: Slicing

Python uses 0-based indexing.

Let’s remind ourselves that Python uses 0-based indexing. This means that the first element in an object is located at position

This is different from other tools like R and Matlab that index elements within objects starting at 1.

In [8]:
# Create a list of numbers:
a = [1, 2, 3, 4, 5]

In [9]:
#Return the first value
a[0]

1

In [12]:
#What is this doing
len(a)

#Doesn't work
#a[len(a)]

#Does
a[len(a) - 1]

5

## Slicing Subsets of Rows

Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax: data[start:stop]. When slicing in pandas the start bound is included in the output. The stop bound is one step BEYOND the row you want to select. So if you want to select rows 0, 1 and 2 your code would look like this:

In [13]:
# Select rows 0, 1, 2 (row 3 is not selected)
surveys_df[0:3]

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,


The stop bound in Python is different from what you might be used to in languages like Matlab and R.


In [14]:
# Select the first 5 rows (rows 0, 1, 2, 3, 4)
surveys_df[:5]

# Select the last element in the list
# (the slice starts at the last element, and ends at the end of the list)
surveys_df[-1:]

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
35548,35549,12,31,2002,5,,,,


## Copying Vs. Referencing

Example:

In [15]:
# Using the 'copy() method'
true_copy_surveys_df = surveys_df.copy()

# Using the '=' operator
ref_surveys_df = surveys_df

You might think that the code ref_surveys_df = surveys_df creates a fresh distinct copy of the surveys_df DataFrame object. However, using the = operator in the simple statement y = x does not create a copy of our DataFrame. Instead, y = x creates a new variable y that references the same object that x refers to. To state this another way, there is only one object (the DataFrame), and both x and y refer to it.

In contrast, the copy() method for a DataFrame creates a true copy of the DataFrame.

Let’s look at what happens when we reassign the values within a subset of the DataFrame that references another DataFrame object:

In [16]:
# Assign the value `0` to the first three rows of data in the DataFrame
ref_surveys_df[0:3] = 0

In [18]:
# ref_surveys_df was created using the '=' operator
#ref_surveys_df.head()

# surveys_df is the original dataframe
#surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,0,0,0,0,0,0,0,0.0,0.0
1,0,0,0,0,0,0,0,0.0,0.0
2,0,0,0,0,0,0,0,0.0,0.0
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


What is the difference between these two dataframes?

When we assigned the first 3 columns the value of 0 using the ref_surveys_df DataFrame, the surveys_df DataFrame is modified too. Remember we created the reference ref_survey_df object above when we did ref_survey_df = surveys_df. Remember surveys_df and ref_surveys_df refer to the same exact DataFrame object. If either one changes the object, the other will see the same changes to the reference object.

To review and recap:



In [20]:
# Copy uses the dataframe’s copy() method
true_copy_surveys_df = surveys_df.copy()

In [21]:
# A Reference is created using the = operator
ref_surveys_df = surveys_df

In [22]:
# Create a brand new clean dataframe from the original csv file
surveys_df = pd.read_csv("data/raw/surveys.csv")

## Slicing Subsets of Rows and Columns in Python

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

`loc` is primarily label based indexing. Integers may be used but they are interpreted as a label.
`iloc` is primarily integer based indexing
To select a subset of rows and columns from our DataFrame, we can use the iloc method. For example, we can select month, day and year (columns 2, 3 and 4 if we start counting at 1), like this:

In [23]:
# iloc[row slicing, column slicing]
surveys_df.iloc[0:3, 1:4]

Unnamed: 0,month,day,year
0,7,16,1977
1,7,16,1977
2,7,16,1977


Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.

Let’s explore some other ways to index and select subsets of data:

In [28]:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]

# What does this do?
#surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]

#Fails because it is both integer and label

# What happens when you type the code below?
# surveys_df.loc[[0, 10, 35549], :]

#Missing label failure

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
10,11,7,16,1977,5,DS,F,53.0,


**NOTE:** Labels must be found in the DataFrame or you will get a KeyError.

Indexing by labels loc differs from indexing by integers iloc. With loc, both the start bound and the stop bound are inclusive. When using loc, integers can be used, but the integers refer to the index label and not the position. For example, using loc and select 1:4 will get a different result than using iloc to select rows 1:4.

We can also select a specific data value using a row and column location within the DataFrame and iloc indexing: