# Betalyzer


## Installation

I'm running Anaconda with the version dump below. I've tested this code using the latest builds of most libraries. The only library that I'm using that does not come default with Anaconda is Quandl, which would need to be installed using `pip install Quandl`. Note that unless you Recalculate (where Quandl is necessary), you don't need to install Quandl to run this application.

Run the app by running `python app.py` and navigating to http://localhost:5000/.

In [1]:
import sys
sys.version

'3.5.2 |Anaconda custom (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]'

## Technologies
I used the following components (other than the obvious like Python, HTML, CSS, jQuery).
 - **[Quandl](http://www.quandl.com):** I chose Quandl as the primary datasource because the API is documented and the data sources are much fuller than Yahoo Finance.
 - **[Pandas](http://pandas.pydata.org/):** Most calculations are done in pandas. Pandas uses numpy in the backend, but provides a layer of abstraction as well as a variety of tools built on top. 
 - **[Flask](http://flask.pocoo.org/):** Flask is a lightweight web framework written in Python. As they say, the Navy uses Django; pirates use Flask. 
 - **[Bokeh](http://bokeh.pydata.org/en/latest/):** Though I've used primarily Highcharts and D3 before, this is my first project using bokeh for data visualization. I'm extremely impressed by the ease of data viz that it provides as well as its seamless integration with both web output as well as pandas DataFrame inputs.
 - **[Bootstrap](http://getbootstrap.com/):** I use Twitter's Bootstrap v3 on the frontend.
 - **[Datatables](https://datatables.net/):** Datatables is a jQuery plugin that renders tables for the web extremely quickly. It has a lot of functionality around sorting, viewing, etc.
 - **[Jupyter Notebook](http://jupyter.org/):** I do most of my prototyping in Jupyter, as well as use it to write these reports.

## Application Structure
Betalyzer is a Flask app but there are several files.
 - **`betalyzer.py`:** All of the business logic is in this file.
 - **`app.py`:** This is the Flask app and handles the execution of the web app.
 - **`templates/`:** We have three pages -- `index.html` which has our app's outline, `main.html` which displays the content on the main page (the Datatable and the charts) and `ticker.html` which shows the timeseries of beta for a single ticker.
 - **Pickles:** For performance reasons, I do not want to fetch the data and recalculate the historical betas every time a user opens the web page. So I store the calculations in pickle files, including `df_tickers` for all the tickers, `df_betas` as a timeseries for historical betas by ticker and `df_changes` as a timeseries for daily changes by ticker.
 - **`betalyzer.ipynb`:** This file. This is the resulting report.
 - **`beta-calc-optimizations.ipynb`:** This shows the optimizations performed to improve the calculation speed by a factor of 3000x.
 - **`play.ipynb`:** I have left the play Notebooks in place where I did some of my sratch work, though none of them are actually used.

## Data Issues
There were a variety of data issues that I came across. 

First, getting a list of current tickers was a bit tricky. Neither Quandl nor Yahoo Finance provided an easy way to accomplish this. However, NASDAQ publishes a daily list of securities traded on its exchange. `nasdaq_url` in `betalyzer.py` sets this source. So I only have NASDAQ stocks in Betalyzer currently.

Second, according to Quandl, their best data source for daily stock price data is WIKI. However, WIKI does not have ETFs, and to calculate the market move, I would like to use an ETF such as SPY. Therefore, I use WIKI for all individual stock price data, and I use GOOG for SPY data.

Third, GOOG is not perfect. It does not seem to have a very good calendar in the API, and contains non business days (primarily holidays as opposed to weekends). Also, GOOG does not adjust for stock splits and dividends. Since we only get SPY from GOOG, the main issue is that it is not dividend adjusted. Meanwhile, WIKI has adjusted for dividends and splits as we use the column 'Adj Close'.

Fourth, there is missing data in WIKI from Quandl. For example, GOOG only starts in mid 2014. Also, a ticker like 'CELG' (randomly chosen) has several missing data points:

```
[In] df_changes['CELG'][df_changes['CELG'].isnull()]
[Out]
Date
2014-10-02   NaN
2014-11-05   NaN
2014-12-03   NaN
2014-12-08   NaN
2014-12-11   NaN
2014-12-29   NaN
2015-01-07   NaN
2015-01-09   NaN
2015-01-16   NaN
2015-01-27   NaN
2015-02-05   NaN
2015-02-06   NaN
2015-07-13   NaN
Name: CELG, dtype: float64
```

There's an option to .fillna(0) to solve this issue temporarily. However, ideally, we'd fillna with the market return so as to not bias the beta.

## Settings

The following settings allow the user to choose how betas are pulled and calculated.

 - **`start_date` / `end_date`:** Dates to start and end the pull from Quandl (note that the Quandl API seems to not be respecting my start dates and end dates).
 - **`market`:** Ticker of the market security to use. Defaulted to SPY.
 - **`test_ticker`:** Ticker of a security that we know will have the correct calendar from Quandl (ie no missing days). Defaulted to AAPL.
 - **`window`:** Beta window to calculate over. Defaulted to 100.
 - **`ticker_limit`:** Number of tickers to get. Defaulted to 300. **To test the script, I suggest setting this to a small value like 10.**
 - **`ticker_choice`:** ['MARKETCAP' or 'RANDOM'] Either choose tickers by the largest market caps or randomly. Defaulted to MARKETCAP.
 - **`handle_nans`:** ['KEEP' or 'FILLZERO'] Either keep nans or replace them with zeros (this is in `df_changes` -- where zeros would represent percent changes, not prices). ENH: Would like to add 'FILLMARKET'.
 - **`nasdaq_url`:** URL to get tickers.

## Questions

In the problem statement, there were specific questions, which are answered below.

**We strongly recommend using Anaconda ... and including an environment file.**

I have included `environment.yml` but the only package that you need to install beyond the standard Anaconda distribution is Quandl.

**Your basket should include at least 300 names.**

See `ticker_limit` in `betalyzer.py`. Default setting is 300. It is currently set to get the largest marketcap stocks, but using randomchoice instead of the current sort functionality can randomize the selection of stocks.

**Limit manual scrubbing of data.**

This implementation requires no manual scrubbing of data -- everything is automated.

**Select a window and compute running betas**

See `window` in `betalyzer.py`. Default setting is 100.

**Can you achieve better performance than what `numpy.linalg.lstsq` offers? Optimize your algo to compute betas for many names and days at a time.**

Using pandas' rolling functions, we achieve speeds many orders of magnitude faster than `lstsq`. We can compute betas for 1,000 stocks over 5,000+ days (total of 5mm+ calculations) in around 5 seconds.

**Enhance your algorithm to handle unexpected inputs.**

There's an option called `handle_nans` that allows the user to convert `np.nan`s to zeros. This prevents missing data on particular days to cause gaps in the output. 

**Design a web app to visualize. You may use a lightweight framework or a more feature rich framework.**

As described above, I chose Flask.

**Visualize the data. The plot should incorporate simple features such as panning and zooming.**

As described  above, I chose Bokeh, which has that functionality built in.

**What other types of plots would be useful to visualize this data?**

Bokeh has been a great tool for visualization. I'd add more interactivity to the plots to enable selecting and viewing particular cohorts of stocks and betas. Here are some examples of extensions of Bokeh plotting that I would like to adapt:

 - https://demo.bokehplots.com/apps/movies
 - https://demo.bokehplots.com/apps/selection_histogram
 
Primarily, I would help visualize the evolution of beta in two dimensions --

 - Historically, compared to the same stock (ie, is Stock X's beta higher or lower than where it has been) and 
 - Cross sectionally, compared to similar stocks' (industry / marketcap), how does this stock's beta compare?

## Extensions

Should this have been a production level project, here are some extensions that could have been added:

 - **TESTS:** I would add unittests and some regression tests to various components of the app, including but not limited to the `build_betas()` function as well as saving the previous beta output to ensure that the new beta output is in line with our previous ones. This should also catch data errors (in case, for example, Quandl somehow sends us different historical data than it did the day before). Additionally, we'd need tests around the data (ex. columns in NASDAQ csv) to ensure that we're getting the data we expect.
 - **Additional data:** We currently only get NASDAQ data. We should get data from NYSE as well to cover most equities. We also do not get ETFs. Additionally, there are some shortcomings in the adjustments to the SPY data (it is not total return) that need to be taken into account. 
 - **Datastore:** The data is currently stored in pickled DataFrames, and only the data we need (the rest is discarded). Ideally, we'd move this to a Postgres database, and perhaps Redshift if the data starts moving beyond a few GBs, for optimal performance. 
 - **Select recalculation:** Currently, on recalculation, all data is fetched from Quandl, even data that we've used before, and all calculations are made again. Ideally, we should only fetch new data and calculate based on the new data. 
 - **Auto-recalculation:** Currently, only manual recalculation is supported. I would add a process that auto-recalculates every business day at a certain time when both NASDAQ and Quandl have updated their data. Perhaps an asynchronus process and a scheduler (Celery and Celery Beat is what I've used in the past) could help here.
 - **Settings through UI:** Currently, the modal that pops up in the web app does not contain options to edit the settings in  `betalyzer.py` (see settings above). Ideally, I'd add these to the 'Recalculation' modal, so the user can select start date, end date, market index, beta calculation window, etc.
 - **[numba](http://numba.pydata.org/):** Luckily, the components we needed to calculate beta (`cov`, `var`) were built in to pandas' rolling library. However, we may have faced a situation where these functions were not built in. In that case, we could have used the `numba` to optimize our calculations. (Note I'm no expert on numba -- one area I'd like to improve upon).