**Fin 585**  
**Diether**  
**Intro to Portfolios** 


**1 Overview**

+ This notebook introduces the concept of **portfolios.**

+ It also introduces **portfolio construction** using Python/Pandas.

+ It covers programming concepts for basic portfolio formation and computing portfolio returns.

+ Portfolio formation and computing portfolio returns relies heavily on the groupby programming construct.<br><br>


**2 Portfolios: Conceptual Overview**

+ A portfolio is a collection of assets (stock, bonds, etc).

+ Portfolios aren't an artificial construct. $\leftarrow$ if you own any financial assets, you have a portfolio.

+ These assets can be primitive securities like stocks or bonds.

+ These assets can also be other portfolios.

**2.1 A Portfolio's Defined by Two Parameters**

+ Two paramaters $\rightarrow$ 

  1. The assets in the portfolio.
  
  2. The weights on each assets $\leftarrow$ weight = percent of overall investment.
  
+ Basic measure of portfolio performance $\rightarrow$ returns (percent change in value of the portfolio for a give unit of time)<br><br>


**2.2 Example Portfolio**

+ You invest 30% in Krispy Kreme's Stock and 70\% in Google's stock.

+ Weights: $w_{g} = 0.7$ and $w_{k} = 0.3$.

+ For a standard portfolio (called a unit cost portfolio), the weights must sum to 1.

+ The one period return (period $t$) for any asset ($i$) is the following (d = dividend and P = Price):
$$
r_{it} = r_t = \frac{d_t + P_t - P_{t-1}}{P_{t-1}} 
$$

+ If the asset is Google:
$$
r_{gt} = \frac{P_{gt} - P_{g,t-1} + d_{gt}}{P_{g,t-1}}= \frac{P_{gt} + d_{gt}}{P_{g,t-1}} - 1
$$

+ It's just the percentage change in value of the asset including cash payments or dividends (payouts to investors) during the period.

+ Given portfolios are defined by the assets in the portfolio and the weights, the return on our two asset portfolio is the following (call the portfolio P):
\begin{align*}
r_p  &= wr_g + (1-w)r_k \\
r_{p} &= 0.7r_{g} + 0.3r_{k}
\end{align*}<br><br>


**2.2 N-Asset Portfolio**

+ In general we can write the return on a portfolio with N assets as the following:

$$
r_{p} = \sum_{i=1}^{N} \omega_{i}r_{i}, \quad \text{where} \quad \sum_{i=1}^{N} \omega_{i} = 1  
$$

+ $r_i$ refers to the return on asset $i$.

+ $\omega_i$ refers to the weight on asset $i$ in the portfolio.<br><br>


**3. Portfolio Construction Framework**

1. Data preparation.

2. Creation of the portfolio formation variable.

3. Binning the stock return data based the formation variable.

4. Portfolio creation.

5. Estimating historical performance of the portfolios or testing economic models using portfolios.<br><br>


**3.1 Today's Focus for Our Framework**

+ Today, we introduce steps 2-4.

+ But it's certainly not the last time we will discuss these steps.


**Goal of Step 1 (Data Preparation)**

+ I've already done step 1

+ Our goal for step 1 is to get the data in panel form.

+ That form $\rightarrow$ panel data with two dimensions: date and entity (e.g, different stocks)

+ For example, our data today in the panel form: monthly-stock data.

+ Permno/caldt defines on observation.

+ Returns and prices of the stocks are going to be are variables of interest.

```
    permno      caldt ticker     prc       ret
0    10026 2020-09-30   JJSF  130.39 -0.036668
1    10026 2020-10-30   JJSF  135.57  0.039727
2    10026 2020-11-30   JJSF  145.39  0.072435
3    10026 2020-12-31   JJSF  155.37  0.072598
4    10028 2020-09-30    ELA    4.29  0.105670
5    10028 2020-10-30    ELA    4.04 -0.058275
6    10028 2020-11-30    ELA    4.62  0.143560
7    10028 2020-12-31    ELA    5.20  0.125540
8    10032 2020-09-30   PLXS   70.63 -0.071513
9    10032 2020-10-30   PLXS   69.54 -0.015432
10   10032 2020-11-30   PLXS   74.71  0.074346
11   10032 2020-12-31   PLXS   78.21  0.046848
```

+ We often want to group and then transform data by stock ID or date.

+ Portfolio construction typically involves both.<br><br>


**3.2 Data for our Grouping Example**

+ The data are monthly stock prices and returns for all publicly trading stocks in the U.S from 2020-2023.

+ The data are drawn from the standard academic source: the Center for Research and Security Prices (CRSP).  

+ The basic unit of observation is the stock id-month. 

+ You can download the data directly using the following link: [the data](https://diether.org/markets/02-mstk.csv).

+ Data variables:

|Variable | Description                                       |
|---------|---------------------------------------------------|
|permno   | stock identifier                                  |
|caldt    | calendar date                                     |
|ticker   | ticker symbol                                     |
|prc      | month end price                                   |
|ret      | monthly return                                    |
|vol      | monthly shares traded (in 1,000s)                 |
|shr      | shares outstanding (in 1,000s)                    |   

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("https://diether.org/prephd/02-mstk.csv",parse_dates=['caldt'])
df

Unnamed: 0,permno,caldt,ticker,prc,ret,vol,shr
0,10026,2020-01-31,JJSF,165.84,-0.100020,22433.0,18919.0
1,10026,2020-02-28,JJSF,160.82,-0.030270,18648.0,18919.0
2,10026,2020-03-31,JJSF,121.00,-0.244030,39302.0,18888.0
3,10026,2020-04-30,JJSF,127.03,0.049835,35670.0,18888.0
4,10026,2020-05-29,JJSF,128.63,0.012596,27534.0,18888.0
...,...,...,...,...,...,...,...
195670,93436,2023-08-31,TSLA,258.08,-0.034962,25029000.0,3174000.0
195671,93436,2023-09-29,TSLA,250.22,-0.030456,24395000.0,3179000.0
195672,93436,2023-10-31,TSLA,200.84,-0.197350,25906000.0,3178900.0
195673,93436,2023-11-30,TSLA,240.08,0.195380,26396000.0,3178900.0


<br>

**4. Our First Portfolio: Equal-Weight Portfolio of All Stocks**

+ It's actual an easy portfolio portfolio to contruct.

+ Step 1: done, in panel form.

+ Step 2: portfolio formation variable $\leftarrow$ all stocks.

+ Step 3: bin the data $\leftarrow$ no binning, want all stocks every month.

+ Step 4: portfolio creation and returns $\leftarrow$ some work here.


**Equal-Weight Portfolio of All Stocks**

+ Equal-weight portfolios very common.<br>

+ Relatively easy to program.<br>

+ Each stock's weight in the portfolio is 1/N<br>

+ Implies we rebalance the weights every month $\leftarrow$ buy/sell at the end of every month to equalize the weights


**Step 4: Need to Implement the Formula**

+ Every month, the return on the portfolio is the following (where $r_i$ is the return on the ith asset in the portfolio):
\begin{align*}
r_p &= \frac{1}{n}r_1 + \frac{1}{n}r_2 + \frac{1}{n}r_3 + \cdots  + \frac{1}{n}r_n \\
    &= \frac{1}{n} \bigl(r_1 + r_2 + r_3 + \cdots  + r_n \bigr) \\
    &= \frac{1}{n} \sum_{i=1}^{n} r_i 
\end{align*}

+ Note, the preceding is just an average across all stocks in a given month.

+ That gives us a shortcut.

+ Computationally all portfolio returns can be thought of as weighted sums or weighted means.<br><br>


**4.1 Implementing Step 4 in Python**

+ So conceptually to form this portfolio we want to do the following:

  1. group the observations by calender month<br>
  
  2. loop through each of the months computing the average across the stocks (equivalent to the equal-weight portfolio return)<br>
  
  3. save those portfolio returns into a new dataframe.<br>

+ Python/Pandas is really good at this $\leftarrow$ **just a simple groupby.** <br><br>

In [4]:
df.groupby('caldt')['ret']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7a0e874a33d0>

In [5]:
df.groupby('caldt')['ret'].mean()

caldt
2020-01-31   -0.011827
2020-02-28   -0.072003
2020-03-31   -0.223604
2020-04-30    0.192569
2020-05-29    0.081158
2020-06-30    0.071063
2020-07-31    0.040600
2020-08-31    0.048534
2020-09-30   -0.026077
2020-10-30    0.008787
2020-11-30    0.205235
2020-12-31    0.096995
2021-01-29    0.109630
2021-02-26    0.087031
2021-03-31    0.012874
2021-04-30    0.012305
2021-05-28    0.010925
2021-06-30    0.030081
2021-07-30   -0.042014
2021-08-31    0.023724
2021-09-30   -0.028491
2021-10-29    0.023253
2021-11-30   -0.053557
2021-12-31   -0.008938
2022-01-31   -0.089722
2022-02-28   -0.006617
2022-03-31    0.017577
2022-04-29   -0.107670
2022-05-31   -0.024743
2022-06-30   -0.069112
2022-07-29    0.079689
2022-08-31   -0.002836
2022-09-30   -0.108023
2022-10-31    0.064331
2022-11-30    0.001524
2022-12-30   -0.060592
2023-01-31    0.153287
2023-02-28   -0.037292
2023-03-31   -0.063745
2023-04-28   -0.021450
2023-05-31   -0.000910
2023-06-30    0.059115
2023-07-31    0.050704
2023-

In [6]:
port = df.groupby('caldt')['ret'].mean()*100 # makes returns in percentage terms
port.describe().round(3) # if you have -100 you lost all your money

count    48.000
mean      0.844
std       8.090
min     -22.360
25%      -4.490
50%       0.516
75%       6.042
max      20.523
Name: ret, dtype: float64

<br>

**5 Closer Look at groupby/Apply**

+ Let's do a simple groupby using a function that just prints out each group.

+ Have to write a simple function.

In [7]:
def out(x):
    print(x,'\n')

(df.groupby('caldt')[['permno','caldt','ret']].apply(out))

        permno      caldt       ret
0        10026 2020-01-31 -0.100020
48       10028 2020-01-31  0.607410
96       10032 2020-01-31 -0.075643
144      10044 2020-01-31 -0.098592
192      10051 2020-01-31 -0.115180
...        ...        ...       ...
195435   93423 2020-01-31 -0.154730
195483   93426 2020-01-31  0.015882
195531   93429 2020-01-31  0.026833
195579   93434 2020-01-31  0.023810
195627   93436 2020-01-31  0.555160

[3591 rows x 3 columns] 

        permno      caldt       ret
1        10026 2020-02-28 -0.030270
49       10028 2020-02-28  0.225810
97       10032 2020-02-28 -0.067070
145      10044 2020-02-28 -0.064904
193      10051 2020-02-28 -0.055669
...        ...        ...       ...
195436   93423 2020-02-28 -0.337000
195484   93426 2020-02-28 -0.204690
195532   93429 2020-02-28 -0.071904
195580   93434 2020-02-28  0.372090
195628   93436 2020-02-28  0.026777

[3593 rows x 3 columns] 

        permno      caldt       ret
2        10026 2020-03-31 -0.244030
50       1

In [8]:
def avg(x):
    return x.mean()

df.groupby('caldt')['ret'].apply(avg)

caldt
2020-01-31   -0.011827
2020-02-28   -0.072003
2020-03-31   -0.223604
2020-04-30    0.192569
2020-05-29    0.081158
2020-06-30    0.071063
2020-07-31    0.040600
2020-08-31    0.048534
2020-09-30   -0.026077
2020-10-30    0.008787
2020-11-30    0.205235
2020-12-31    0.096995
2021-01-29    0.109630
2021-02-26    0.087031
2021-03-31    0.012874
2021-04-30    0.012305
2021-05-28    0.010925
2021-06-30    0.030081
2021-07-30   -0.042014
2021-08-31    0.023724
2021-09-30   -0.028491
2021-10-29    0.023253
2021-11-30   -0.053557
2021-12-31   -0.008938
2022-01-31   -0.089722
2022-02-28   -0.006617
2022-03-31    0.017577
2022-04-29   -0.107670
2022-05-31   -0.024743
2022-06-30   -0.069112
2022-07-29    0.079689
2022-08-31   -0.002836
2022-09-30   -0.108023
2022-10-31    0.064331
2022-11-30    0.001524
2022-12-30   -0.060592
2023-01-31    0.153287
2023-02-28   -0.037292
2023-03-31   -0.063745
2023-04-28   -0.021450
2023-05-31   -0.000910
2023-06-30    0.059115
2023-07-31    0.050704
2023-

In [9]:
df.groupby('caldt')['ret'].agg(avg)

caldt
2020-01-31   -0.011827
2020-02-28   -0.072003
2020-03-31   -0.223604
2020-04-30    0.192569
2020-05-29    0.081158
2020-06-30    0.071063
2020-07-31    0.040600
2020-08-31    0.048534
2020-09-30   -0.026077
2020-10-30    0.008787
2020-11-30    0.205235
2020-12-31    0.096995
2021-01-29    0.109630
2021-02-26    0.087031
2021-03-31    0.012874
2021-04-30    0.012305
2021-05-28    0.010925
2021-06-30    0.030081
2021-07-30   -0.042014
2021-08-31    0.023724
2021-09-30   -0.028491
2021-10-29    0.023253
2021-11-30   -0.053557
2021-12-31   -0.008938
2022-01-31   -0.089722
2022-02-28   -0.006617
2022-03-31    0.017577
2022-04-29   -0.107670
2022-05-31   -0.024743
2022-06-30   -0.069112
2022-07-29    0.079689
2022-08-31   -0.002836
2022-09-30   -0.108023
2022-10-31    0.064331
2022-11-30    0.001524
2022-12-30   -0.060592
2023-01-31    0.153287
2023-02-28   -0.037292
2023-03-31   -0.063745
2023-04-28   -0.021450
2023-05-31   -0.000910
2023-06-30    0.059115
2023-07-31    0.050704
2023-

<br>

**6 Portfolios Formed on High or Low Lagged Price**

+ Let's form two portfolios:

  1. Portfolio contains stocks with low prices: $P_{lag} \le 5$.
  
  2. Portfolio contains with stocks with higher prices: $P_{lag} > 5$.<br><br>


**6.1 Portfolio Formation Framework**

1. Data prep: done, data in panel form.

2. Portfolio formation variable $\leftarrow$ **lagged price**.

3. Bin the data $\leftarrow$ create binning variable that equals 0 if $P_{lag} \le 5$ and 1 if $P_{lag} > 5$.

4. Portfolio creation and returns $\leftarrow$ let's form equal-weight portfolios for each.<br><br>


**6.2 Portfolio Formation Variable: Lagged Price**

+ Key $\rightarrow$ in portfolio construction can only use info we would have if in real time.

+ Can create terrible biases in testing if ignored.

+ **Always be careful with this issue.**

+ Returns are of time = t

+ Asset selection and portfolio construction info has to come from $t-1$ or earlier.

+ Therefore, price must be lagged.

**Pandas: Use Shift**

+ The code in the next call is wrong. Why?

+ We need to use a groupby with shift. Why?

In [10]:
df['prclag'] = df['prc'].shift(1) # price carries over despite change in ticker
df.tail(60)

Unnamed: 0,permno,caldt,ticker,prc,ret,vol,shr,prclag
195615,93434,2023-01-31,SANW,1.465,-0.016779,12629.0,42763.0,1.49
195616,93434,2023-02-28,SANW,1.83,0.24915,58240.0,42786.0,1.465
195617,93434,2023-03-31,SANW,1.445,-0.21038,9352.0,42889.0,1.83
195618,93434,2023-04-28,SANW,1.36,-0.058824,7355.0,42889.0,1.445
195619,93434,2023-05-31,SANW,1.06,-0.22059,8888.0,42964.0,1.36
195620,93434,2023-06-30,SANW,1.22,0.15094,6626.0,42979.0,1.06
195621,93434,2023-07-31,SANW,1.21,-0.008197,4757.0,42979.0,1.22
195622,93434,2023-08-31,SANW,0.9401,-0.22306,6950.0,42979.0,1.21
195623,93434,2023-09-29,SANW,1.12,0.19136,9959.0,42979.0,0.9401
195624,93434,2023-10-31,SANW,0.6697,-0.40205,8926.0,43039.0,1.12


In [11]:
df['prclag'] = df.groupby('permno')['prc'].shift(1)
df.tail(60)

Unnamed: 0,permno,caldt,ticker,prc,ret,vol,shr,prclag
195615,93434,2023-01-31,SANW,1.465,-0.016779,12629.0,42763.0,1.49
195616,93434,2023-02-28,SANW,1.83,0.24915,58240.0,42786.0,1.465
195617,93434,2023-03-31,SANW,1.445,-0.21038,9352.0,42889.0,1.83
195618,93434,2023-04-28,SANW,1.36,-0.058824,7355.0,42889.0,1.445
195619,93434,2023-05-31,SANW,1.06,-0.22059,8888.0,42964.0,1.36
195620,93434,2023-06-30,SANW,1.22,0.15094,6626.0,42979.0,1.06
195621,93434,2023-07-31,SANW,1.21,-0.008197,4757.0,42979.0,1.22
195622,93434,2023-08-31,SANW,0.9401,-0.22306,6950.0,42979.0,1.21
195623,93434,2023-09-29,SANW,1.12,0.19136,9959.0,42979.0,0.9401
195624,93434,2023-10-31,SANW,0.6697,-0.40205,8926.0,43039.0,1.12


<br>**Bin the Data with Cut Based on Lagged Price**

+ `pd.cut` takes breakpoints and bins the data.

+ Specify the breakpoint values in a list: [0,5,500000] 

+ Creates two bins (0,5] and (5,500000]

In [None]:
pd.cut(df['prclag'],[0,5,500000]) # by default upper range is less than equal to 

0                     NaN
1         (5.0, 500000.0]
2         (5.0, 500000.0]
3         (5.0, 500000.0]
4         (5.0, 500000.0]
               ...       
195670    (5.0, 500000.0]
195671    (5.0, 500000.0]
195672    (5.0, 500000.0]
195673    (5.0, 500000.0]
195674    (5.0, 500000.0]
Name: prclag, Length: 195675, dtype: category
Categories (2, interval[int64, right]): [(0, 5] < (5, 500000]]

In [13]:
pd.cut(df['prclag'],[0,5,500000],labels=False)

0         NaN
1         1.0
2         1.0
3         1.0
4         1.0
         ... 
195670    1.0
195671    1.0
195672    1.0
195673    1.0
195674    1.0
Name: prclag, Length: 195675, dtype: float64

In [22]:
df['bins'] = pd.cut(df['prclag'],[0,5,1000000],labels=False)
df

Unnamed: 0,permno,caldt,ticker,prc,ret,vol,shr,prclag,bins
0,10026,2020-01-31,JJSF,165.84,-0.100020,22433.0,18919.0,,
1,10026,2020-02-28,JJSF,160.82,-0.030270,18648.0,18919.0,165.84,1.0
2,10026,2020-03-31,JJSF,121.00,-0.244030,39302.0,18888.0,160.82,1.0
3,10026,2020-04-30,JJSF,127.03,0.049835,35670.0,18888.0,121.00,1.0
4,10026,2020-05-29,JJSF,128.63,0.012596,27534.0,18888.0,127.03,1.0
...,...,...,...,...,...,...,...,...,...
195670,93436,2023-08-31,TSLA,258.08,-0.034962,25029000.0,3174000.0,267.43,1.0
195671,93436,2023-09-29,TSLA,250.22,-0.030456,24395000.0,3179000.0,258.08,1.0
195672,93436,2023-10-31,TSLA,200.84,-0.197350,25906000.0,3178900.0,250.22,1.0
195673,93436,2023-11-30,TSLA,240.08,0.195380,26396000.0,3178900.0,200.84,1.0


In [24]:
df[['prclag', 'bins']].describe()

Unnamed: 0,prclag,bins
count,190560.0,187279.0
mean,151.929625,0.769419
std,6704.125945,0.421206
min,-1010.5,0.0
25%,5.23,1.0
50%,16.485,1.0
75%,46.86,1.0
max,546720.0,1.0


<br>

**6.3 Compute Returns of the Low/High Lagged Price Based Portfolios**

+ Use the same basic code as our equal-weight portfolio off all stocks.

+ But want to group on date/bin combinations.

+ Pandas does this too $\rightarrow$ **two way groupby**.

+ For each date/bin combination, compute an equal-weight portfolio return (equivalent to average return across the assets for each bin in a given month).

In [None]:
port = df.groupby(['caldt','bins'])['ret'].mean()*100 
# grouping by the date and the bin to 
# see the returns for 5< or 5>=
port

caldt       bins
2020-02-28  0.0     -5.147967
            1.0     -7.853632
2020-03-31  0.0    -24.987104
            1.0    -21.696438
2020-04-30  0.0     30.280413
                      ...    
2023-10-31  1.0     -6.233325
2023-11-30  0.0      7.863910
            1.0      9.125371
2023-12-29  0.0     13.684424
            1.0     11.046642
Name: ret, Length: 94, dtype: float64

<br>

**Trick: Unstack**

+ Nobody like this data arrangement for portfolios.

+ Want to make the dataframe look like a matrix.

+ Use unstack to make bins into columns.

In [17]:
port = df.groupby(['caldt','bins'])['ret'].mean()*100
port = port.unstack(level='bins')
port

bins,0.0,1.0
caldt,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-02-28,-5.147967,-7.853632
2020-03-31,-24.987104,-21.696438
2020-04-30,30.280413,14.982232
2020-05-29,12.751737,6.703513
2020-06-30,18.682843,3.639238
2020-07-31,10.438906,2.319033
2020-08-31,2.163001,5.533756
2020-09-30,-2.399827,-2.658771
2020-10-30,-1.192995,1.440759
2020-11-30,30.26785,18.106438


In [18]:
port.describe().round(3)

bins,0.0,1.0
count,47.0,47.0
mean,1.365,0.827
std,13.668,7.103
min,-24.987,-21.696
25%,-8.509,-3.835
50%,-1.193,0.946
75%,7.3,5.522
max,38.764,18.106


<br>

**BYU Finance Library**

+ Going to use the my Finance library's summary function.

+ It adds ad t-stat that the mean of each column equals zero.

In [25]:
from finance_byu.summarize import summary
summary(port)

bins,0.0,1.0
count,47.0,47.0
mean,1.365445,0.827352
std,13.668227,7.103274
tstat,0.684874,0.798511
pval,0.49686,0.428678
min,-24.987104,-21.696438
25%,-8.508778,-3.835055
50%,-1.192995,0.945579
75%,7.299949,5.52207
max,38.76391,18.106438


In [None]:
summary(port).loc[['count','mean','std','tstat','pval']].round(3)