# Getting the source data

In this notebook we get the source data that we need to build our model.

The main security holdings data is downloaded from https://dataverse.harvard.edu. We use this source instead of https://www.sec.gov because the Harvard data comes in 2 simple tables, instead of many XML files for the SEC data.

The Harvard data only contains CUSIPs (security identifiers) and does not contain security names, we get the names from https://www.sec.gov as PDFs (see for example [this PDF](https://www.sec.gov/divisions/investment/13f/13flist2004q2.pdf)). We pre-processed these PDFs in python and generated a simple TSV<sup>1</sup> file that we will use here. For more details see the folder *../cusips*.

<sup>1</sup>: Tab Separated Values

Let's get the security names and see a few examples:

In [None]:
!mkdir -p source_data
!cp ../cusips/cusips.tsv source_data/cusips.tsv

In [None]:
import pandas as pd
cusips = pd.read_csv('source_data/cusips.tsv', sep='\t')
cusips

In [None]:
cusips.describe()

Next we get the holdings data:

In [None]:
!wget https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/ZRH3EU/7ATD7M -O source_data/scrape_parsed.parquet

The holdings data is fairly large, as it contains almost 20 years of holdings reporting for all investors.

In [None]:
holdings = pd.read_parquet('source_data/scrape_parsed.parquet')
holdings

Precisely we have 75 quarters of reporting between 1999 and 2017.

In [None]:
holdings.rdate.nunique(), holdings.rdate.min(), holdings.rdate.max()

We see 8170 investors identified by their Central Index Key, and 17811 securities, identified by their CUSIP. 

In [None]:
holdings.nunique()

As for securities, we want to identify investors by name, so we download the CIK to name mapping.

In [None]:
!wget https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/ZRH3EU/RRNFLT -O source_data/cikmap.tab

In [None]:
cik = pd.read_csv('source_data/cikmap.tab', sep='\t')
cik

In [None]:
cik.nunique()