# Using DataFrames

This lesson introduces:

* Computing returns (percentage change)
* Basic mathematical operations on DataFrames

This first cell load data for use in this lesson.

In [1]:
# Setup: Load prices
import pandas as pd

prices = pd.read_hdf("data/dataframes.h5", "prices")
sep_04 = pd.read_hdf("data/dataframes.h5", "sep_04")
goog = pd.read_hdf("data/dataframes.h5", "goog")

## Problem: Compute Returns

Compute returns using 

```python
returns = prices.pct_change()
```

which computes the percentage change.

Additionally, extract returns for each name using 

```python
spy_returns = returns["SPY"]
```

In [2]:
returns = prices.pct_change()
returns.head()

Unnamed: 0,SPY,AAPL,GOOG
2018-09-04,,,
2018-09-05,-0.002691,-0.006525,-0.008789
2018-09-06,-0.00301,-0.016617,-0.012676
2018-09-07,-0.001943,-0.008068,-0.005643
2018-09-10,0.001739,-0.013421,-0.000163


In [3]:
# First row is missing since no data on Sep 3, can use .dropna() to remove rows with missing values
returns = returns.dropna()
returns

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,-0.002691,-0.006525,-0.008789
2018-09-06,-0.00301,-0.016617,-0.012676
2018-09-07,-0.001943,-0.008068,-0.005643
2018-09-10,0.001739,-0.013421,-0.000163
2018-09-11,0.003297,0.025283,0.010922
2018-09-12,0.000242,-0.012419,-0.01235
2018-09-13,0.005914,0.024155,0.010758
2018-09-14,0.000172,-0.011351,-0.002382
2018-09-17,-0.005294,-0.026626,-0.014055
2018-09-18,0.005426,0.001652,0.004472


In [4]:
spy_returns = returns["SPY"]
goog_returns = returns.GOOG
aapl_returns = returns["AAPL"]

## Problem: Compute Log Returns

```python
import numpy as np

log_returns = np.log(prices).diff()
```

first difference of the natural log of the prices. Mathematically this is 
$r_{t}=\ln\left(P_{t}\right)-\ln\left(P_{t-1}\right)=\ln\left(\frac{P_{t}}{P_{t-1}}\right)\approx\frac{P_{t}}{P_{t-1}}-1$.

In [5]:
import numpy as np

log_returns = np.log(prices).diff()
log_returns

Unnamed: 0,SPY,AAPL,GOOG
2018-09-04,,,
2018-09-05,-0.002695,-0.006546,-0.008827
2018-09-06,-0.003015,-0.016757,-0.012757
2018-09-07,-0.001945,-0.008101,-0.005659
2018-09-10,0.001737,-0.013512,-0.000163
2018-09-11,0.003292,0.024969,0.010863
2018-09-12,0.000242,-0.012497,-0.012427
2018-09-13,0.005897,0.023868,0.010701
2018-09-14,0.000172,-0.011416,-0.002385
2018-09-17,-0.005308,-0.026987,-0.014155


## Basic Mathematical Operations

|  Operation            | Symbol | Precedence |
|:----------------------|:------:|:----------:|
| Parentheses           | ()     | 4          |
| Exponentiation        | **     | 3          |
| Multiplication        | *      | 2          | 
| Division              | /      | 2          |
| Floor division        | //     | 2          |
| Modulus               | %      | 2          | 
| Matrix multiplication | @      | 2          |
| Addition              | +      | 1          |
| Subtraction           | -      | 1          |

**Note**: Higher precedence operators are evaluated first, and ties are
evaluated left to right. 


## Problem: Scalar Operations
1. Add 1 to all returns
2. Square the returns
3. Multiply the price of Google by 2. 
4. Extract the fractional return using floor division and modulus

In [6]:
1 + returns

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,0.997309,0.993475,0.991211
2018-09-06,0.99699,0.983383,0.987324
2018-09-07,0.998057,0.991932,0.994357
2018-09-10,1.001739,0.986579,0.999837
2018-09-11,1.003297,1.025283,1.010922
2018-09-12,1.000242,0.987581,0.98765
2018-09-13,1.005914,1.024155,1.010758
2018-09-14,1.000172,0.988649,0.997618
2018-09-17,0.994706,0.973374,0.985945
2018-09-18,1.005426,1.001652,1.004472


In [7]:
returns**2

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,7.243734e-06,4.3e-05,7.724016e-05
2018-09-06,9.06051e-06,0.000276,0.0001606848
2018-09-07,3.776667e-06,6.5e-05,3.183925e-05
2018-09-10,3.022472e-06,0.00018,2.660615e-08
2018-09-11,1.087328e-05,0.000639,0.0001192864
2018-09-12,5.864758e-08,0.000154,0.0001525142
2018-09-13,3.49813e-05,0.000583,0.0001157416
2018-09-14,2.955709e-08,0.000129,5.675399e-06
2018-09-17,2.802939e-05,0.000709,0.0001975452
2018-09-18,2.944302e-05,3e-06,1.99999e-05


In [8]:
2 * goog

2018-09-04    2394.00
2018-09-05    2372.96
2018-09-06    2342.88
2018-09-07    2329.66
2018-09-10    2329.28
2018-09-11    2354.72
2018-09-12    2325.64
2018-09-13    2350.66
2018-09-14    2345.06
2018-09-17    2312.10
2018-09-18    2322.44
2018-09-19    2317.56
Name: GOOG, dtype: float64

In [9]:
returns % 1

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,0.997309,0.993475,0.991211
2018-09-06,0.99699,0.983383,0.987324
2018-09-07,0.998057,0.991932,0.994357
2018-09-10,0.001739,0.986579,0.999837
2018-09-11,0.003297,0.025283,0.010922
2018-09-12,0.000242,0.987581,0.98765
2018-09-13,0.005914,0.024155,0.010758
2018-09-14,0.000172,0.988649,0.997618
2018-09-17,0.994706,0.973374,0.985945
2018-09-18,0.005426,0.001652,0.004472


In [10]:
returns - returns // 1

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,0.997309,0.993475,0.991211
2018-09-06,0.99699,0.983383,0.987324
2018-09-07,0.998057,0.991932,0.994357
2018-09-10,0.001739,0.986579,0.999837
2018-09-11,0.003297,0.025283,0.010922
2018-09-12,0.000242,0.987581,0.98765
2018-09-13,0.005914,0.024155,0.010758
2018-09-14,0.000172,0.988649,0.997618
2018-09-17,0.994706,0.973374,0.985945
2018-09-18,0.005426,0.001652,0.004472


## Problem: Addition of Series
Add the returns on SPY to those of AAPL 


In [11]:
spy_returns + aapl_returns

2018-09-05   -0.009216
2018-09-06   -0.019628
2018-09-07   -0.010011
2018-09-10   -0.011682
2018-09-11    0.028580
2018-09-12   -0.012177
2018-09-13    0.030070
2018-09-14   -0.011179
2018-09-17   -0.031920
2018-09-18    0.007078
2018-09-19   -0.005510
dtype: float64

## Problem: Combining methods and mathematical operations
Using only basic mathematical operations compute the 
correlation between the returns on AAPL and SPY. 

In [12]:
a = aapl_returns - aapl_returns.mean()
s = spy_returns - spy_returns.mean()

# Get the number of observations
nobs = len(a)
aapl_var = a.dot(a) / (nobs - 1)
spy_var = s.dot(s) / (nobs - 1)
aapl_spy_cov = a.dot(s) / (nobs - 1)

corr = aapl_spy_cov / np.sqrt(aapl_var * spy_var)
print(f"The correlation is {corr}")

The correlation is 0.7738944403258624


## Problem: Addition of DataFrames
Construct a `DataFrame` that only contains the SPY column from returns
and add it to the return `DataFrame`  

In [13]:
spy_df = pd.DataFrame(returns.SPY)

returns + spy_df

Unnamed: 0,AAPL,GOOG,SPY
2018-09-05,,,-0.005383
2018-09-06,,,-0.00602
2018-09-07,,,-0.003887
2018-09-10,,,0.003477
2018-09-11,,,0.006595
2018-09-12,,,0.000484
2018-09-13,,,0.011829
2018-09-14,,,0.000344
2018-09-17,,,-0.010589
2018-09-18,,,0.010852


## Problem: Non-conformable math

Add the prices in `sep_04` to the prices of `goog`. What happens? 

In [14]:
sep_04 + goog

2018-09-04 00:00:00   NaN
2018-09-05 00:00:00   NaN
2018-09-06 00:00:00   NaN
2018-09-07 00:00:00   NaN
2018-09-10 00:00:00   NaN
2018-09-11 00:00:00   NaN
2018-09-12 00:00:00   NaN
2018-09-13 00:00:00   NaN
2018-09-14 00:00:00   NaN
2018-09-17 00:00:00   NaN
2018-09-18 00:00:00   NaN
2018-09-19 00:00:00   NaN
AAPL                  NaN
GOOG                  NaN
SPY                   NaN
dtype: float64

## Problem: Constructing portfolio returns
Set up a 3-element array of portfolio weights 

$$w=\left(\frac{1}{3},\,\frac{1}{3}\,,\frac{1}{3}\right)$$

and compute the return of a portfolio with weight $\frac{1}{3}$ in each security.


In [15]:
import numpy as np

# Make a 3 by 1 array
w = np.array([[1 / 3, 1 / 3, 1 / 3]]).T

port_ret = returns @ w
port_ret

Unnamed: 0,0
2018-09-05,-0.006002
2018-09-06,-0.010768
2018-09-07,-0.005218
2018-09-10,-0.003948
2018-09-11,0.013167
2018-09-12,-0.008176
2018-09-13,0.013609
2018-09-14,-0.00452
2018-09-17,-0.015325
2018-09-18,0.00385


## Exercises

### Exercise: Combine math with function

Add 1 to the output of `np.arange` to produce the sequence 1, 2, ..., 10.

In [16]:
import numpy as np

1 + np.arange(10)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

### Exercise: Understand pandas math

Use the `Series` and `DataFrame` below to compute the sums 

* `a+b`
* `a+c`
* `b+c`
* `a+b+c`

to understand how missing values are treated by pandas

In [17]:
# Setup: Data for exercise
import numpy as np
import pandas as pd

rs = np.random.RandomState(19991231)

idx = ["A", "a", "B", 3]
columns = ["A", 1, "B", 3]
a = pd.Series([1, 2, 3, 4], index=idx)
b = pd.Series([10, 9, 8, 7], index=columns)
values = rs.randint(1, 11, size=(4, 4))
c = pd.DataFrame(values, columns=columns, index=idx)

In [18]:
a + b

1     NaN
3    11.0
A    11.0
B    11.0
a     NaN
dtype: float64

In [19]:
a + c

Unnamed: 0,1,3,A,B,a
A,,14.0,6.0,13.0,
a,,13.0,6.0,9.0,
B,,13.0,11.0,8.0,
3,,9.0,3.0,7.0,


In [20]:
b + c

Unnamed: 0,A,1,B,3
A,15,17,18,17
a,15,18,14,16
B,20,17,13,16
3,12,11,12,12


In [21]:
a + b + c

Unnamed: 0,1,3,A,B,a
A,,21.0,16.0,21.0,
a,,20.0,16.0,17.0,
B,,20.0,21.0,16.0,
3,,16.0,13.0,15.0,


### Exercise: Math with duplicates

Add the Series `d` to `a` to see what happens with delays.

In [22]:
# Setup: Data for exercise

d = pd.Series([10, 101], index=["A", "A"])

In [23]:
a + d

3      NaN
A     11.0
A    102.0
B      NaN
a      NaN
dtype: float64