# Intro

- Indexing, slicing, subsetting dataframes
- Combining Dataframes with pandas
- Data Workflows and Automation
- Plotting data


# Indexing, slicing, subsetting dataframes
**45 min**
13:00 – 13:45

In this lesson, we will explore ways to access different parts of the data using:

- indexing,
- slicing, and
- subsetting.

We will continue to use the surveys dataset that we worked with in the last episode. Let’s reopen and read in the data again:

In [None]:
# Make sure pandas is loaded
import pandas as pd

# Read in the survey CSV
surveys_df = pd.read_csv("surveys.csv")

## Basic math functions on a column

Quick note:

In [None]:
# Multiply all weight values by 2
surveys_df['weight']*2

We often want to work with subsets of a DataFrame object. 

There are different ways to accomplish this including:

- selecting columns
- slicing rows
- selecting data based on location in dataframe (rows and columns)
- selecting data based on values

## Selecting columns using labels (column names)

We use square brackets to select a subset of a Python object. This includes `pandas` dataframes. 

Let's first see which column we want to select:

In [None]:
surveys_df.columns

Now that we know our column names, we can select all data from a column named `species_id`  in the `surveys_df` DataFrame by using the column name. 

There are two ways to do this:

In [None]:
# TIP: use the .head() method we saw earlier to make output shorter
# Method 1: select a 'subset' of the data using the column name
# this is the square bracket notation
surveys_df['species_id']

In [None]:
# Method 2: use the column name as an 'attribute'; gives the same output
surveys_df.species_id

We can assign our selection to a variable and also take a look at what type of object this is:

In [None]:
surveys_species = surveys_df['species_id']
type(surveys_species)

It's a series, including an index from the original dataframe and the values from the species_id column at that index.

Take note that this index has numeric values and they start at '0'.


We can also select multiple columns, what's different about the notation here? Why?

In [None]:
surveys_df[['species_id', 'plot_id']]

What happens if you don't use the 'extra' square brackets?

In [None]:
surveys_df['species_id', 'plot_id']

## A note about indexes

The positional index on an object, including a dataframe or series, is numeric and starts at '0'.

You can get a value, or a range of values, by specifying their index. Let's first see how this looks on a simpler, one-dimensional, object like a list:

In [None]:
a = [1, 2, 3, 4, 5]

Now, if I want to the second element in `a` we can use it's index to get it. 

In [None]:
a[1]

Why did specifying an index of '1' give me the second value?

This is because our indexes start at '0'

If I want to get a range of values based on their index, I can use `:`. This is calles "slicing":

In [None]:
a[:]

Leaving this empty returns a slice with all values.

In [None]:
a[1:1]

What do you think is happening here? Why didn't we get a value?

Slicing by index is exclusive of the last index specified:

In [None]:
a[1:2]

## Slicing rows using positional indexes 
Bringing this positional indexing knowledge back to our dataframe, we can actually slice out rows by also using bracket notation:

In [None]:
surveys_df[0:3]

Now if we think about it, that's a bit strange.
A moment ago, our notation for getting a column was pretty much the same:

    surveys_df['column_name']

now we're using:
    
    survey_df[#:#]

And suddenly we're slicing rows instead of getting columns. There is some back-end magic happening here with the bracket notation.  

This all get's more confusing once you know that:

- You can change the "index" label referring to rows in a dataframe to text instead of numbers, or anything you want; 
- and the columns not only have names/labels, but also a positional index of their own.

So, it's actually better to be much more explicit about what you're grabbing from a dataframe and how you're specifying it's location. 

This can be by positional index or by a name/label.

## Selecting data based on location

### Use DataFrame.iloc[..., ...] to select values by their (entry) position
We can specify the location of a value that we want to get from a dataframe using only positional (numeric) indexes.

This is analogous to the index selection in our list, but now we're 2-dimensional:

In [None]:
#surveys_df.iloc[row_index, column_index] < this is the notation

surveys_df.iloc[0, 5]

This time, we're getting a single value at the vertical position of '0' and the horizonatal position of '5'. We're referring to the column with a positional index of '5', the name of this column is `species_id`.

If we want a slice, we can use the same ':' notation:

In [None]:
#surveys_df.iloc[row_index:row_index, column_index:column_index] < this is the notation
surveys_df.iloc[0:3, 1:4]

**Challenge**: Use `.loc` to return the values for the last five rows of the `species_id` and `sex` columns.

In [None]:
# Challenge solutions
surveys_df.record_id.count()
surveys_df.iloc[35544:35549, 5:7]

# or

surveys_df.iloc[-5:, 5:7]

### Use DataFrame.loc[..., ...] to select values by their (entry) label.

Remember that a DataFrame provides a index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

Can specify location by row name analogously to 2D version of dictionary keys.


In [None]:
#surveys_df.loc[row_label, column_label] < this is the notation
surveys_df.loc[0, 'species_id']

Now, if we're selecting values by their label, why are we still using '0' to identify the row?

*(Draw answer)*

We can also pull out slices:

In [None]:
#surveys_df.loc[row_label:row_label, column_label:column_label] < this is the notation
surveys_df.loc[0:3, 'species_id':'hindfoot_length']

**Note**: When we're using the labels instead of the positional index, the end of the range for the slice is inclusive!

### Result of slicing can be used in further operations

In [None]:
surveys_df.loc[0:3, 'species_id':'hindfoot_length'].mean()

## Subsetting Data using Criteria

Now we' look at how you can subset data based on criteria. This is done using some pretty simple syntax, and it's another use of square bracket notation.

### Use comparisons to select data based on value

In [None]:
surveys_df[surveys_df.year == 2002]

This will give us all rows where the year is 2002. Why do we use two equals signs instead of one?

Or we can select all rows that do not contain the year 2002:

In [None]:
surveys_df[surveys_df.year != 2002]

We can also combine sets of criteria:

In [None]:
surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]

# & for and
# | for or
# surveys_df[(surveys_df.year == 1980) | (surveys_df.year == 1985)]

- Equals: ==
- Not equals: !=
- Greater than, less than: > or <
- Greater than or equal to: >=
- Less than or equal to: <=


### About using Boolean masks

Now let's talk about what's going on when we do

`surveys_df[surveys_df.year == 2002]`

What do we get if we just type:

In [None]:
surveys_df.year == 2002

What we get is a series telling us the the index of a row in `surveys_df` and a boolean value of True/False depending on whether the year column contains 2002 in that particular row.

Now, we can save this to a variable and apply it to our dataframe:

In [None]:
mask = surveys_df.year == 2002

In [None]:
surveys_df[mask]

This is called using a "boolean mask" and will return the rows in your dataframe where your mask gives a True value, on the same row index.

`pandas` also has a neat function that will tell us whether or not values in a datafame are missing, let's find out more about it:

In [None]:
pd.isna?

This indicates whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``.

This also returns a dataframe to you filled with booleans!

In [None]:
pd.isna(surveys_df)

To select the rows where there are null values in **any** column, we can use the mask as an index to subset our data as follows:

In [None]:
# To select just the rows with NaN values in any column, we can use the 'any()' method
# axis = 1 refers to columns

surveys_df[pd.isna(surveys_df).any(axis=1)]

**Optional Challenge**: Create a new DataFrame that only contains observations with sex values that are not female or male. 

Assign each sex value in the new DataFrame to a new value of ‘x’. 

Determine the number of null values in the subset.

In [None]:
new_surveys = surveys_df[(surveys_df.sex != 'M') & (surveys_df.sex != 'F')]
new_surveys.sex = 'x'
new_surveys

In [None]:
pd.isna(new_surveys).any(axis=1).count()

## Key points
- In Python, portions of data can be accessed using indices, slices, column headings, and condition-based subsetting.
- Python uses 0-based indexing, in which the first element in a list, tuple or any other data structure has an index of 0.
- Pandas enables common data exploration steps such as data indexing, slicing and conditional subsetting.