# Indexing and Selection

Now that we are familiar with pandas' data structures, we can turn our attention to some of the intermediate features of data frames, which include:
    
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- **Hierarchical labeling** of axes
- **Sorting and ranking** of data in DataFrames
- Easy handling of **missing data**
- Data **summarization** tools

In this section, we will manipulate data collected from ocean-going vessels on the eastern seaboard. Vessel operations are monitored using the **Automatic Identification System (AIS)**, a safety at sea navigation technology which vessels are required to maintain and that uses transponders to transmit very high frequency (VHF) radio signals containing static information including ship name, call sign, and country of origin, as well as dynamic information unique to a particular voyage such as vessel location, heading, and speed. 

![AIS](images/ais.gif)

The International Maritime Organization’s (IMO) International Convention for the Safety of Life at Sea requires functioning AIS capabilities on all vessels 300 gross tons or greater and the US Coast Guard requires AIS on nearly all vessels sailing in U.S. waters. The Coast Guard has established a national network of AIS receivers that provides coverage of nearly all U.S. waters. **AIS signals** are transmitted several times each minute and the network is capable of handling thousands of reports per minute and updates as often as every two seconds. Therefore, a typical voyage in our study might include the transmission of hundreds or thousands of AIS encoded signals. This provides a rich source of spatial data that includes both **spatial and temporal information**.

For our purposes, we will use **summarized data** that describes the transit of a given vessel through a particular administrative area. The data includes the start and end time of the transit segment, as well as information about the speed of the vessel, how far it travelled, etc.

In [None]:
import pandas as pd
import numpy as np

vessels = pd.read_csv('../data/AIS/vessel_information.csv', index_col=0)

In [None]:
vessels.shape

## Indexing and Selection

Indexing works analogously to indexing in NumPy arrays, except we can use the labels in the `Index` object to extract values in addition to arrays of integers.

In [None]:
vessels.columns

In [None]:
# Sample Series object
flag = vessels.flag
flag

In [None]:
# Numpy-style indexing
flag[:10]

In [None]:
# Indexing by label
flag[[298716,725011300]]

In a `DataFrame` we can slice along either or both axes:

In [None]:
vessels[['num_names','num_types']].head()

In [None]:
vessels[vessels.max_loa > 700]

The indexing field `loc` allows us to select subsets of rows and columns in an intuitive way:

In [None]:
vessels.loc[720768000, ['names','flag', 'type']]

In [None]:
vessels.loc[:4731, 'names']

Slicing also works with string variables, since an index has an intrinsic order, regardless of label:

In [None]:
vessels.columns

In [None]:
vessels.loc[:310, 'flag':'loa']

In addition to using `loc` to select rows and columns by **label**, pandas also allows indexing by **position** using the `iloc` attribute.

So, we can query rows and columns by absolute position, rather than by name:

In [None]:
vessels.iloc[:5, 5:8]

## Indexing with `where`

Pandas `DataFrame` objects also posess a `where` index for indexing that returns the values that satisfy the condition, but retain the index of the original `DataFrame`, so that the shape does not change. This is important when **alignment** is required for operations between `DataFrame`s.

In [None]:
np.random.seed(42)
normal_vals = pd.DataFrame({'x{}'.format(i):np.random.randn(100) for i in range(5)})

normal_vals.head()

In [None]:
normal_vals.where(normal_vals > 0).head()

`where` includes an optional `other` argument that accepts a scalar or tabular values (or a callable) to replace values in the `DataFrame` that do not satisfy the condition.

For example, we can use this to return the absolute values of `X`:

In [None]:
normal_vals.where(normal_vals > 0, other=-normal_vals).head()

Similarly, a callable can be used when we need to modify the replaced value:

In [None]:
normal_vals.where(normal_vals>0, other=lambda y: -y*100).head()

Conversely, `mask` is the inverse boolean of `where`:

In [None]:
normal_vals.mask(normal_vals>0).head()

## Selection with `query`

At times, selection using indexing can be verbose because it requires repeated use of the `DataFrame` namespace.

In [None]:
normal_vals[(normal_vals.x1 > normal_vals.x0) & (normal_vals.x3 > normal_vals.x2)].head()

For a more concise (and readable) syntax, we can use the new `query` method to perform selection on a `DataFrame`. Instead of having to type the fully-specified column, we can simply pass a string that describes what to select. The query above is then simply:

In [None]:
normal_vals.query('(x1 > x0) & (x3 > x2)').head()

The `DataFrame.index` and `DataFrame.columns` are placed in the query namespace by default. If you want to refer to a variable in the current namespace, you can prefix the variable with `@`:

In [None]:
min_loa = 700

In [None]:
vessels.query('max_loa > @min_loa')