# Subsetting the data

## About the Data
In this notebook, we will be working with earthquake data from September 18, 2018 - October 13, 2018 (obtained from the US Geological Survey (USGS) using the [USGS API](https://earthquake.usgs.gov/fdsnws/event/1/))

## Setup
We will be working with the `data/earthquakes.csv` file again, so we need to handle our imports and read it in.

In [None]:
import pandas as pd

df = pd.read_csv('data/earthquakes.csv')

## Selecting columns
Grab an entire column using attribute notation:

In [None]:
df.mag

Grab an entire column using dictionary syntax:

In [None]:
df['mag']

Selecting multiple columns:

In [None]:
df[['mag', 'title']]

Selecting columns using list comprehensions and string operations:

In [None]:
df[
    ['title', 'time']
    + [col for col in df.columns if col.startswith('mag')]
]

Breaking down this example:
1. the list comprehension

In [None]:
[col for col in df.columns if col.startswith('mag')]

2. assembling the list

In [None]:
['title', 'time'] \
+ [col for col in df.columns if col.startswith('mag')]

3. using this list as the list of columns

In [None]:
df[
    ['title', 'time']
    + [col for col in df.columns if col.startswith('mag')]
]

## Slicing
### Selecting rows
Using row numbers (inclusive of first index, exclusive of last):

In [None]:
df[100:103]

### Selecting rows and columns with chaining

In [None]:
df[['title', 'time']][100:103]

Order doesn't matter here:

In [None]:
df[100:103][['title', 'time']].equals(
    df[['title', 'time']][100:103]
)

So we know how to select rows and columns, but can we update values? Well, if we try using what we have learned so far, we will see the following warning:

In [None]:
df[110:113]['title'] = df[110:113]['title'].str.lower()

Note that it worked here, but `pandas` says we were setting a value on a copy of a slice and that we should use `loc` instead (topic of the following section):

In [None]:
df[110:113]['title']

## Indexing

Now if we do this with `loc` as the warning suggests, everything goes smoothly. Note we have to lower the end index by one since `loc` is inclusive of endpoints:

In [None]:
df.loc[110:112, 'title'] = df.loc[110:112, 'title'].str.lower()
df.loc[110:112, 'title']

### Indexing with `loc`
Selection of the format `loc[row_indexer, column_indexer]` where `:` can be used to select all:

In [None]:
df.loc[:,'title']

We can use `loc` to select specific rows and columns without chaining. If we use row numbers with `loc`, they are now **inclusive** of the end index:

In [None]:
df.loc[10:15, ['title', 'mag']]

#### Indexing with `iloc`
Exclusive of the endpoint just as Python slicing:

In [None]:
df.iloc[10:15, [19, 8]]

We can use slicing syntax with `iloc` for both rows and columns:

In [None]:
df.iloc[10:15, 6:10]

When using `loc`, we can slice on column names. This will be inclusive of the endpoint because you can't be expected to know what the next column name will be. As such, we have multiple ways to achieve the same end goal:

In [None]:
df.iloc[10:15, 6:10].equals(
    df.loc[10:14, 'gap':'magType']
)

### Looking up scalar values
We used `loc` and `iloc` to grab subsets of the dataframe. However, if we are just interested in the specific value at a given `[row, column]`, then we can use `iat` and `at`. We use `at` with labels:

In [None]:
df.at[10, 'mag']

...and `iat` with integer indices:

In [None]:
df.iat[10, 8]

## Filtering
We can filter our dataframes using a **Boolean mask**, which can be made as follows:

In [None]:
df.mag > 2

To use a mask for selection, we simply place it inside the brackets:

In [None]:
df[df.mag >= 7.0]

We can use masks with `loc`:

In [None]:
df.loc[
    df.mag >= 7.0,
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Masks can be created using multiple criteria when combined with bitwise operators `&` for AND and `|` for OR. We must also surround each criterion with parentheses. We can't use `and`/`or` here because we need to evaluate row by row:

In [None]:
df.loc[
    (df.tsunami == 1) & (df.alert == 'red'),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

An example with an OR condition, which is less restrictive:

In [None]:
df.loc[
    (df.tsunami == 1) | (df.alert == 'red'),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Masks can be created from any criteria that results in a Boolean. For example, we can select all earthquakes with the string `Alaska` in the `place` column with a non-null value for the `alert` column. To get non-nulls, we can use the `isnull()` method with the bitwise negation operator (`~`) or the `notnull()` method:

In [None]:
df.loc[
    (df.place.str.contains('Alaska')) & (df.alert.notnull()),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

We can even use regular expressions here:

In [None]:
df.loc[
    (df.place.str.contains(r'CA|California$')) & (df.mag > 3.8),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

We can use the `between()` method to turn 2 individual checks (is less than or equal to some maximum value and is greater than or equal to some minimum value) into a single one. Note this is inclusive of the endpoint by default:

In [None]:
df.loc[
    df.mag.between(6.5, 7.5),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

We can use the `isin()` method to check for membership in a list of values:

In [None]:
df.loc[
    df.magType.isin(['mw', 'mwb']),
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

We can grab the index of the minimum and maximum values of a given column and use those to select the entire row where they occur:

In [None]:
[df.mag.idxmin(), df.mag.idxmax()]

In [None]:
df.loc[
    [df.mag.idxmin(), df.mag.idxmax()],
    ['alert', 'mag', 'magType', 'title', 'tsunami', 'type']
]

Note that there is a `filter()` method, but it doesn't filter the data in the same sense as we discussed in this section. Here are a few things you can do with this method.

- grab columns of a dataframe by passing a list to `items`:

In [None]:
df.filter(items=['mag', 'magType']).head()

- grab all the columns that contain a string with the `like` parameter:

In [None]:
df.filter(like='mag').head()

- use regular expressions; here, we select any columns that start with `t`:

In [None]:
df.filter(regex=r'^t').head()

- use `filter()` along the rows, by passing in `axis=0`. Here, we will use the `place` column as the index (we will cover `set_index()` in lab 3):

In [None]:
df.set_index('place').filter(like='Japan', axis=0).filter(items=['mag', 'magType', 'title']).head()

This also works on `Series` objects and will run on the index:

In [None]:
df.set_index('place').title.filter(like='Japan').head()

<hr>
<div>
    <a href="./4-inspecting_dataframes.ipynb">
        <button style="float: left;">&#8592; Previous Notebook</button>
    </a>
    <a href="./6-adding_and_removing_data.ipynb">
        <button style="float: right;">Next Notebook &#8594;</button>
    </a>
</div>
<br>
<hr>