# `pandas` - Indexing & Selecting data

__Contents__:
1. Setup
1. Select columns 
1. Select rows

## Reference
- http://pandas.pydata.org/pandas-docs/stable/indexing.html
- http://pandas.pydata.org/pandas-docs/stable/index.html
- http://pandas.pydata.org/pandas-docs/stable/dsintro.html

## 1. Setup

Load libraries.

In [6]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

The most common way to create a DataFrame is to use the `read_csv` (pandas) function to read a CSV file. 

Another common technique is to use the `DataFrame` function, which has three parameters:
1. `data`, which is a numpy array, a dictionary or another DataFrame (examples of each follow)
1. `index`, which is a list of the names of the rows
1. `columns`, which is a list of the names of the columns

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Create a sample dataframe for the demonstrations below.

In [9]:
df_col = pd.DataFrame(data=[[100, 200, 300, 400],
                            [101, 201, 301, 401],
                            [102, 202, 302, 402]], 
                      columns=['col_a', 'col_b', 'col_c', 'col_d']
                     )
df_col

## 2. Select Columns

A column can be accessed as an attribute of the dataframe.

In [12]:
x = df_col.col_b
print(type(x))
x

Notice that a series is returned.

A column can be accessed using (single) square brackets.

In [15]:
x = df_col['col_b']
print(type(x))
x

Notice that a series is returned.

In [17]:
col_name = 'col_b'
x = df_col[col_name]
print(type(x))
x

Notice that a series is returned.

A dataframe can be returned with specific column by using double square brackets.

In [20]:
x = df_col[['col_b']]
print(type(x))
x

A dataframe with multiple columns can be returned by passing a list of columns to the square brackets.

In [22]:
x = df_col[['col_b','col_c']]
print(type(x))
x

In [23]:
col_list = ['col_b','col_c']
x = df_col[col_list]
print(type(x))
x

Columns can also be selected using the `iloc` method.

This method specifies columns by their integer location.

In [25]:
x = df_col.iloc[:,0]
print(type(x))
x

In [26]:
x = df_col.iloc[:,[0,2]]
print(type(x))
x

Columns can be accessed, by their names, using the `loc` method.

This method specifies columns by their name.

In [28]:
x = df_col.loc[:,'col_b']
print(type(x))
x

In [29]:
x = df_col.loc[:,['col_b']]
print(type(x))
x

In [30]:
x = df_col.loc[:,['col_b','col_a']]
print(type(x))
x

In [31]:
x = df_col.loc[:,'col_a':'col_b']
print(type(x))
x

## 3. Select Rows

Create a dataframe with a numeric index that is different than the position of the row in the dataframe.

In [34]:
df_num_index = pd.DataFrame([[100, 200, 300, 400],
                             [101, 201, 301, 401],
                             [102, 202, 302, 402]], 
                            columns=['col_a', 'col_b', 'col_c', 'col_d'],
                            index  =[100, 200, 300]
                           )
df_num_index

Retrieve rows of a dataframe based on the numeric position of the rows in that dataframe.

In [36]:
df_num_index.iloc[0:2,:]

Retrieve rows of a dataframe based on the numeric index of the rows in that dataframe.

In [38]:
df_num_index.loc[[100,300],:]

Create a dataframe with a character index.

In [40]:
df_row_col = pd.DataFrame([[100, 200, 300, 400],
                           [101, 201, 301, 401],
                           [102, 202, 302, 402]], 
                          columns=['col_a', 'col_b', 'col_c', 'col_d'],
                          index  =['row_1', 'row_2', 'row_3']
                         )
df_row_col

Retrieve rows of a dataframe based on the character index of the rows in that dataframe.

In [42]:
df_row_col.loc['row_2',:]

In [43]:
df_row_col.loc[['row_2','row_3'],:]

In [44]:
df_row_col.loc['row_1':'row_3',:]

In [45]:
df_row_col.loc['row_2','col_b':'col_d']

Create a dataframe with a datetime index.

In [47]:
df_dt_col = pd.DataFrame(data=[[100, 200, 300, 400],
                               [101, 201, 301, 401],
                               [102, 202, 302, 402]], 
                         columns=['col_a', 'col_b', 'col_c', 'col_d'],
                         index  =pd.date_range(pd.to_datetime('20180203', 
                                                              format='%Y%m%d'),
                                               periods=3,
                                               freq='D')
                        )
df_dt_col

In [48]:
df_dt_col.iloc[0:2,1:3]

In [49]:
x = df_dt_col.loc['2018-02-03',['col_b','col_d']]
print(type(x))
x

In [50]:
x = df_dt_col.loc['2018-02-04':'2018-02-05',['col_b','col_d']]
print(type(x))
x

In [51]:
x = df_dt_col.loc['2018',['col_b','col_d']]
print(type(x))
x

__Exercise__: try changing the previous cell so that `loc` accepts a list of row names.

In [53]:
x = df_col.iloc[0,:]
print(type(x))
x

__The End__