<a href="https://colab.research.google.com/github/boyerb/Investments/blob/master/Ex07-WRDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Investments: Theory, Fundamental Analysis, and Data Driven Analytics**, Bates, Boyer, and Fletcher

# Example Chapter 7: The WRDS API
In this example we illustrate how to access CRSP data on equities using the WRDS API. The script uses the function `get_crsp_msf_by_ids` to download monthly data where `msf` stands for *monthly stock file*. A similar function, `get_crsp_dsf_by_ids`, allows you to access daily data where `dsf` stands for *daily stock file*. These functions are part of the `simple_finance` custom **module** that can be accessed from the course GitHub repo.  A module is a single Python file that contains functions, classes, or variables. By packaging code into well-designed functions, this module keeps our Python code clean, helps us avoid getting bogged down in less-important coding details, and maintains focus on core objectives. The complete code for this module can be seen at [`https://github.com/boyerb/Investments/blob/master/functions/simple_finance.py`](https://github.com/boyerb/Investments/blob/master/functions/simple_finance.py).

To use this script below, you will first need to obtain a WRDS account through your institution with a username and password, and set up dual authentication through the WRDS site.

###Imports and Setup
Below we install and import one module and three packages that provide useful functions. The module `simple_finance` and the package `wrds` are not included in Colab by default and require installation.

* We use the  `!curl` command to download the custom module `simple_finance.py` from the course GitHub repo.    

* We use the `!pip install`  command to install the WRDS package.  
* The `NumPy` and `Pandas` libraries are already installed in most Python environments, so they can be imported directly without any extra steps.

In [None]:
!curl -O https://raw.githubusercontent.com/boyerb/Investments/master/functions/simple_finance.py
!pip -q install wrds

import simple_finance as sf
import wrds
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Establish a WRDS Connection
When you run the next code block below you will be prompted to enter your WRDS username and password, and then complete dual authentication on your phone. You will then be asked if you want to create a .pgpass file. In local environments such as PyCharm or Visual Studio, a .pgpass file stored on your PC allows you to use the WRDS API without re-entering your credentials each time. In Google Colab, however, the file is saved in the temporary home directory (/root/.pgpass) and will be erased whenever you start a new session.  As such you can just respond "`n`".

In [None]:
db = wrds.Connection()

### Controlling DataFrame Display
By default, Pandas may hide some columns or wrap output across multiple lines when displaying a wide DataFrame. The settings below change this behavior to provide better diaplys of Pandas DataFrames.

In [None]:
pd.set_option('display.max_columns', None)   # Show all columns without truncation
pd.set_option('display.width', 1000)   # Set the display width so output stays on one line

### Download Data
Function: `get_crsp_msf_by_ids`   

**Inputs**
* WRDS database connection object (`db`)
* List of identifiers (tickers or PERMNOS)
* Start date
* End date  

**Output**
* Pandas DataFrame with the desired *monthly* data. A similar function, `sf.get_crsp_dsf_by_ids` will pull daily data.  The variables included in the output DataFrames for these two functions are
* `date`
* `permno`: CRSP permanent security identifier  
* `permco`: CRSP permanent company identifier
* `ticker`
* `comnam`: Company name
* `shrcd`: [Share code](https://wrds-www.wharton.upenn.edu/data-dictionary/form_metadata/crsp_a_stock_msf_identifyinginformation/shrcd/)  
* `exchcd`: [Exchange code](https://wrds-www.wharton.upenn.edu/data-dictionary/form_metadata/crsp_a_stock_msf_identifyinginformation/exchcd/)
* `siccd`:  [Standard Industial Classification Code](https://www.sec.gov/search-filings/standard-industrial-classification-sic-code-list)
* `prc`: A negative price does not indicate that the price fell below zero; rather, it means the price was inferred as the midpoint of the closing inside bid and ask quotes instead of an actual end-of-day trade. When using `prc`, always apply `np.abs(prc)` to ensure the absolute value is used.
* `ret`: Return with dividends
* `retx`: Return without dividends
* `vol`: Volume
* `shrout`: Shares outstanding  

Identifiers can be either **tickers** or **PERMNOs**. The function will automatically detect the type of identifier you are using. Since tickers can change over time, this becomes an important consideration when pulling data over long periods. To address this, CRSP assigns each security a permanent identifier, the PERMNO, that does not change. In addition, WRDS provides an [online tool](https://wrds-www.wharton.upenn.edu/pages/get-data/center-research-security-prices-crsp/annual-update/tools/translate-to-permcopermno/)
 to translate tickers into PERMCO/PERMNO identifiers. The CRSP PERMCO is a permanent *company* identifier, while the PERMNO is a permanent *security identifier*. Several securities may be associated with the same company since firms can issue various classes of equity.  

In [None]:
ids=['IBM','F', 'AAPL']    #
start='2015-01-01'
end = '2024-12-31'
data=sf.get_crsp_msf_by_ids(db,ids,start,end)
print(data.head(3)) # print first 3 rows
print()
print(data.tail(3)) # print last 3 rows

### Calculate Turnover
Turnover is value duing the month divided by shares outstanding at the beginning of the month.  We create a variable `shrout_lag` and divide volume by `shrout_lag`.  

In [None]:
data['date'] = pd.to_datetime(data['date']) # ensure that date is a datetime object
data = data.sort_values(['ticker', 'date']) #sort to make sure lagging works properly
data['shrout_lag'] = data.groupby('ticker')['shrout'].shift(1) # create lagged shrout by ticker
data['turnover'] = data['vol'] / data['shrout_lag'] # compute turnover (volume / lagged shares outstanding)

### Plot Turnover
We first format the data using the `pivot` method so we have a dataframe with three columns:
(1) date
(2) IBM Turnover
(3) AAPL Turnover
We then plot tunover for the tickers selected.

In [None]:
# pivot for plotting
pivot_data = data.pivot(index='date', columns='ticker', values='turnover')

# plot
pivot_data.plot(figsize=(10,6), lw=1)

plt.xlabel("Date")
plt.ylabel("Turnover (Vol / Lagged Shares Outstanding)")
plt.legend(title="Ticker (permno)")
plt.show()