# Accessing Elements in DataFrames

This lesson covers:

* Assessing specific elements in Pandas Series and DataFrames 

Accessing elements in an array or a DataFrame is a common task. To begin this
lesson, clear the workspace set up some vectors and a $5\times5$ array. These
vectors and matrix will make it easy to determine which elements are selected
by a command.

Start by creating 2 DataFrame and 2 Series. Define `x=np.arange(24).reshape(5,5)` 
which is a 5 by 5 array and `y=np.arange(5)` which is a 5-element 1-d array.
We need:

* `x_df`: A default `DataFrame` containing `x`
* `x_named`: A `DataFrame` containing `x` with index `"r0"`, `"r1"`, ..., `"r4"` and
  columns `"c0"`, `"c1"`, ... `"c4"`.
* `y_s`: A default `Series` containing `y`
* `y_named`: A `Series` containing `y` that has the index `"r0"`, `"r1"`, ..., `"r4"`

In [1]:
import numpy as np
import pandas as pd

x = np.arange(25).reshape((5, 5))
y = np.arange(5)
# The -1 tells numpy to automatically compute the size of
# the dimension using the remaining elements, in this case, 5
z = np.arange(5).reshape((-1, 1))

x_df = pd.DataFrame(x)
x_named = pd.DataFrame(
    x, index=["r0", "r1", "r2", "r3", "r4"], columns=["c0", "c1", "c2", "c3", "c4"]
)
y_s = pd.Series(y)
y_named = pd.Series(y, index=x_named.index)

print(f"x = {x}")
print(f"y = {y}")
print(f"z = {z}")

print()
print(f"x_df = \n{x_df}")
print(f"y_s = \n{y_s}")

print()
print(f"x_named = \n{x_named}")
print(f"y_named = \n{y_named}")

x = [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
y = [0 1 2 3 4]
z = [[0]
 [1]
 [2]
 [3]
 [4]]

x_df = 
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24
y_s = 
0    0
1    1
2    2
3    3
4    4
dtype: int64

x_named = 
    c0  c1  c2  c3  c4
r0   0   1   2   3   4
r1   5   6   7   8   9
r2  10  11  12  13  14
r3  15  16  17  18  19
r4  20  21  22  23  24
y_named = 
r0    0
r1    1
r2    2
r3    3
r4    4
dtype: int64


## Problem: Selecting a row by name

Select the 2nd row of `x_name` using `.loc`.


In [2]:
x_named.loc["r1"]

c0    5
c1    6
c2    7
c3    8
c4    9
Name: r1, dtype: int64

## Problem: Selecting a column by name

Select the 2nd columns of `x_name` using  both `[]` and `.loc`.

In [3]:
x_named.loc[:, "c1"]

r0     1
r1     6
r2    11
r3    16
r4    21
Name: c1, dtype: int64

In [4]:
x_named["c1"]

r0     1
r1     6
r2    11
r3    16
r4    21
Name: c1, dtype: int64

In [5]:
x_named.c1

r0     1
r1     6
r2    11
r3    16
r4    21
Name: c1, dtype: int64

## Problem: Selecting a elements of a Series by name

Select the 2nd element of `y_name` using both `[]` and `loc`.


In [6]:
print(y_named["r1"])

1


In [7]:
y_named.loc["r1"]

np.int64(1)

In [8]:
y_named.r1

np.int64(1)

## Problem: Selecting rows and columns by name

Select the 2nd and 4th rows and 1st and 3rd columns of `x_name`.

In [9]:
x_named.loc[["r1", "r3"], ["c0", "c2"]]

Unnamed: 0,c0,c2
r1,5,7
r3,15,17


In [10]:
# Also right, but not recommended
x_named[["c0", "c2"]].loc[["r1", "r3"]]

Unnamed: 0,c0,c2
r1,5,7
r3,15,17


In [11]:
# Also right, but not recommended
x_named.loc[["r1", "r3"]][["c0", "c2"]]

Unnamed: 0,c0,c2
r1,5,7
r3,15,17


## Problem: DataFrame selection with default index and column names

Select the 2nd and 4th rows and 1st and 3rd columns of `x_df`.


In [12]:
x_df.loc[[1, 3], [0, 2]]

Unnamed: 0,0,2
1,5,7
3,15,17


## Problem: Series selection with the default index

Select the final element in `y_s`

In [13]:
y_s[4]

np.int64(4)

In [14]:
y_s.loc[4]

np.int64(4)

## Problem: Subseries selection
Select the subseries of `y_named` and `y_s` containing the first, fourth and fifth element.

In [15]:
y_named[["r0", "r3", "r4"]]

r0    0
r3    3
r4    4
dtype: int64

In [16]:
y_s.loc[[0, 3, 4]]

0    0
3    3
4    4
dtype: int64

Load the data in momentum.csv.

In [17]:
# Setup: Load the momentum data

import pandas as pd

momentum = pd.read_csv("data/momentum.csv", index_col="date", parse_dates=True)
momentum.head()

Unnamed: 0_level_0,mom_01,mom_02,mom_03,mom_04,mom_05,mom_06,mom_07,mom_08,mom_09,mom_10
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-01-04,0.67,-0.03,-0.93,-1.11,-1.47,-1.66,-1.4,-2.08,-1.71,-2.67
2016-01-05,-0.36,0.2,-0.37,0.28,0.16,0.18,-0.22,0.25,0.29,0.13
2016-01-06,-4.97,-2.33,-2.6,-1.16,-1.7,-1.45,-1.15,-1.46,-1.14,-0.45
2016-01-07,-4.91,-1.91,-3.03,-1.87,-2.31,-2.3,-2.7,-2.31,-2.36,-2.66
2016-01-08,-0.4,-1.26,-0.98,-1.26,-1.13,-1.02,-0.96,-1.42,-0.94,-1.32


## Problem: Selecting data on a single day

Select returns on February 16, 2016.


In [18]:
momentum.loc["2016-2-16"]

mom_01    4.94
mom_02    2.46
mom_03    2.59
mom_04    2.17
mom_05    2.24
mom_06    1.83
mom_07    1.57
mom_08    1.56
mom_09    1.35
mom_10    1.72
Name: 2016-02-16 00:00:00, dtype: float64

## Problem: Selecting data in a single month

Select return in March 2016.

In [19]:
march = momentum.loc["2016-3"]
# Use head to nly show top 5 rows
march.head()

Unnamed: 0_level_0,mom_01,mom_02,mom_03,mom_04,mom_05,mom_06,mom_07,mom_08,mom_09,mom_10
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-03-01,1.47,2.37,2.38,2.98,2.61,2.68,2.31,2.12,1.42,2.54
2016-03-02,5.76,3.26,1.53,0.4,0.58,0.56,0.29,0.21,0.39,0.14
2016-03-03,4.09,2.41,1.36,0.99,0.71,0.5,0.38,0.41,0.33,-0.29
2016-03-04,2.69,0.82,1.04,0.54,0.73,0.4,0.09,0.47,0.13,0.02
2016-03-07,3.04,2.27,1.46,0.67,0.63,0.87,0.37,-0.04,-0.17,-1.24


## Problem: Selecting data in a single year

Select return in 2016.


In [20]:
mom_2016 = momentum.loc["2016"]
mom_2016.tail()

Unnamed: 0_level_0,mom_01,mom_02,mom_03,mom_04,mom_05,mom_06,mom_07,mom_08,mom_09,mom_10
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-12-23,1.0,0.39,0.15,0.29,-0.02,0.12,0.08,0.24,0.11,0.63
2016-12-27,0.22,0.33,0.18,0.26,0.26,0.18,0.18,0.41,0.22,1.18
2016-12-28,-1.01,-0.96,-0.78,-0.8,-0.65,-0.82,-0.76,-1.01,-1.26,-1.8
2016-12-29,-0.16,0.05,-0.1,0.07,0.06,-0.21,-0.12,-0.11,-0.02,0.31
2016-12-30,-0.26,-0.54,-0.56,-0.53,-0.68,-0.3,-0.12,-0.56,-0.57,-1.49


## Problem: Selecting data in a date range

Select returns between May 1, 2016, and June 15, 2016.

In [21]:
block = momentum.loc["2016-5-1":"2016-6-15"]
block.tail()

Unnamed: 0_level_0,mom_01,mom_02,mom_03,mom_04,mom_05,mom_06,mom_07,mom_08,mom_09,mom_10
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-06-09,-1.72,-1.38,-0.7,-0.74,-0.73,-0.55,-0.32,-0.16,0.37,0.33
2016-06-10,-3.72,-2.63,-1.91,-1.98,-1.59,-0.95,-0.8,-0.72,-0.39,-0.79
2016-06-13,-1.08,0.14,-1.24,-1.25,-0.97,-0.72,-0.9,-0.48,-0.91,-0.67
2016-06-14,-0.33,-0.7,-0.42,-1.34,-0.85,-0.15,-0.07,-0.02,0.13,0.28
2016-06-15,0.65,0.58,-0.14,0.27,-0.07,-0.06,-0.44,-0.25,-0.2,-0.26


## Exercises

### Exercise: Subset time-series

Select the data for May 2017 for momentum portfolios 1 and 10.

In [22]:
momentum.loc["2017-05", ["mom_01", "mom_10"]]

Unnamed: 0_level_0,mom_01,mom_10
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-05-01,-0.25,1.36
2017-05-02,0.05,-0.82
2017-05-03,-0.75,-0.39
2017-05-04,0.1,-0.09
2017-05-05,0.48,0.95
2017-05-08,-0.52,0.63
2017-05-09,0.79,0.47
2017-05-10,0.31,0.89
2017-05-11,-0.38,0.12
2017-05-12,-0.87,0.36


### Exercise: Select using Months

Using a slice of YYYY-MM, select the returns for momentum portfolio
5 in the final 6 months of 2016 as `Series`

In [23]:
momentum.loc["2016-07":"2016-12", "mom_05"]

# or
momentum.loc["2016-07":"2016-12"].mom_05

date
2016-07-01    0.24
2016-07-05   -1.82
2016-07-06    0.57
2016-07-07    0.40
2016-07-08    1.90
              ... 
2016-12-23   -0.02
2016-12-27    0.26
2016-12-28   -0.65
2016-12-29    0.06
2016-12-30   -0.68
Name: mom_05, Length: 127, dtype: float64

### Exercise: Ensure DataFrame

Repeat the previous problem but ensure the selection is a DataFrame.

In [24]:
momentum.loc["2016-07":"2016-12", ["mom_05"]]

Unnamed: 0_level_0,mom_05
date,Unnamed: 1_level_1
2016-07-01,0.24
2016-07-05,-1.82
2016-07-06,0.57
2016-07-07,0.40
2016-07-08,1.90
...,...
2016-12-23,-0.02
2016-12-27,0.26
2016-12-28,-0.65
2016-12-29,0.06
