# `pandas` - Indexing & Selecting data

## Contents
1. Setup
1. Select columns 
1. Select rows

## Reference
- http://pandas.pydata.org/pandas-docs/stable/indexing.html
- http://pandas.pydata.org/pandas-docs/stable/index.html
- http://pandas.pydata.org/pandas-docs/stable/dsintro.html
- http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
- http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

## 1. Setup

Load libraries.

In [6]:
import pandas  as pd
import numpy   as np
print('pandas',pd.__version__)
print('numpy ',np.__version__)

The most common way to create a DataFrame is to use the `read_csv` (pandas) function to read a CSV file. 

Another common technique is to use the `DataFrame` function, which has three parameters:
1. `data`, which is a numpy array, a dictionary or another DataFrame (examples of each follow)
1. `index`, which is a list of the names of the rows
1. `columns`, which is a list of the names of the columns

Although tables are often stored in CSV files and so the `read_csv` function is used read them into Python, the `DataFrame` function is used in many situations in which more control is needed when creating dataframes. Just a few of these situations are listed below:
- Very small dataframes are needed to demonstrate functions or methods (for instance, this notebook)
- Very small dataframes are needed to demonstrate problems encountered when asking for help. This is not only a good idea, but is a requirement when asking questions on stackoverflow <https://stackoverflow.com>. 
- Dataframes can be constructed using dictionaries, Series, arrays, list-like objects and iterables

In addition, see the documentation below or in the above References section which lists the attributes and methods of the pandas dataframe:
- http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Create a sample dataframe for the demonstrations below.

In [9]:
df_col = pd.DataFrame(data=[[100, 200, 300, 400],
                            [101, 201, 301, 401],
                            [102, 202, 302, 402]], 
                      columns=['col_a', 'col_b', 'col_c', 'col_d']
                     )
df_col

## 2. Select Columns

A column can be accessed as an attribute of the dataframe.

In [12]:
x = df_col.col_b
print(type(x))
x

Notice that a series is returned.

A column can be accessed using (single) square brackets.

In [15]:
x = df_col['col_b']
print(type(x))
x

Notice that a series is returned in both cases above.

A dataframe can be returned with specific column by using double square brackets.

In [18]:
x = df_col[['col_b']]
print(type(x))
x

A dataframe with multiple columns can be returned by passing a list of columns to the square brackets.

In [20]:
x = df_col[['col_b','col_c']]
print(type(x))
x

In [21]:
col_list = ['col_b','col_c']
x = df_col[col_list]
print(type(x))
x

Columns can also be selected using the `iloc` method.

This method specifies columns by their integer location.

In [23]:
x = df_col.iloc[:,0]
print(type(x))
x

Notice that a Series is returned.

Multiple column can be retrieved with a list of integer column locations.

In [26]:
x = df_col.iloc[:,[0,2]]
print(type(x))
x

Notice that a `DataFrame` is returned.

Columns can be accessed, by their names, using the `loc` method.

This method specifies columns by their name.

In [29]:
x = df_col.loc[:3,'col_b']
print(type(x))
x

Notice that a Series is returned.

Multiple column can be retrieved with a list of column labels. 

In the first code cell below, a dataframe with a single column is returned.

In [32]:
x = df_col.loc[:,['col_b']]
print(type(x))
x

In the following code cell, a dataframe with two columns is returned.

In [34]:
x = df_col.loc[:,['col_b','col_a']]
print(type(x))
x

The `loc` method can also be used to retrieve a range of rows between, and including, the rows specified.

In [36]:
x = df_col.loc[:,'col_a':'col_b']
print(type(x))
x

## 3. Select Rows

This section contains three sub-sections which describe three types of indexes: numeric, character and datetime.

### 3.1 Numeric index

Create a dataframe below with a numeric index that is different than the position of the row in the dataframe.

In [41]:
df_num_index = pd.DataFrame([[100, 200, 300, 400],
                             [101, 201, 301, 401],
                             [102, 202, 302, 402]], 
                            columns=['col_a', 'col_b', 'col_c', 'col_d'],
                            index  =[100, 200, 300]
                           )
df_num_index

Recall, the `iloc` method retrieves rows of a dataframe based on the __numeric position__ of the rows in the dataframe.

The colon operator `:` retrieves a range of rows (from the first element and up to, but not including, the second element).

In [43]:
df_num_index.iloc[0:2,:]

The `loc` method retrieve rows of a dataframe based on the index of the dataframe.

In [45]:
df_num_index.loc[100:300,:]

Notice that the endpoints are included in the specified range.

### 3.2 Character index

Create a dataframe with a character index.

In [49]:
df_row_col = pd.DataFrame([[100, 200, 300, 400],
                           [101, 201, 301, 401],
                           [102, 202, 302, 402]], 
                          columns=['col_a', 'col_b', 'col_c', 'col_d'],
                          index  =['row_1', 'row_2', 'row_3']
                         )
df_row_col

Retrieve rows of a dataframe based on the character index of the rows in that dataframe.

In [51]:
df_row_col.loc['row_2',:]

Notice that when a single row is returned the result is Series. If there was a column of type `object` (in `df_row_col`) then the resulting series would have been of type `object`. (try it)

Retrieve two rows by specifying their names in a list.

In [54]:
df_row_col.loc[['row_2','row_3'],:]

Notice that a `DataFrame` is returned. This is the case for all the two examples below.

Retrieve all rows between two endpoints.

In [57]:
df_row_col.loc['row_1':'row_3',:]

Retrieve a single row and all columns between two endpoints.

In [59]:
df_row_col.loc['row_2','col_b':'col_d']

### 3.3 Datetime index

Create a dataframe with a datetime index.

In [62]:
df_dt_col = pd.DataFrame(data=[[100, 200, 300, 400],
                               [101, 201, 301, 401],
                               [102, 202, 302, 402]], 
                         columns=['col_a', 'col_b', 'col_c', 'col_d'],
                         index  =pd.date_range(pd.to_datetime('20180203', 
                                                              format='%Y%m%d'),
                                               periods=3,
                                               freq='D')
                        )
df_dt_col

Notice that the argument to the `format` parameter specifies the format to use when reading the first argument.

The `iloc` method can be used and returns rows and columns based on their position.

In [65]:
df_dt_col.iloc[0:2,1:3]

In [66]:
x = df_dt_col.loc['2018-02-03',['col_b','col_d']]
print(type(x))
x

In [67]:
x = df_dt_col.loc['2018-02-04':'2018-02-05',['col_b','col_d']]
print(type(x))
x

In [68]:
x = df_dt_col.loc['2018',['col_b','col_d']]
print(type(x))
x

__Exercise__: try changing the previous cell so that `loc` accepts a list of row names.

In [70]:
x = df_col.iloc[0,:]
print(type(x))
x

__The End__