## DSI-9 Capstone Project: A recommendation system for UK equities
### Michael Wharton

### Contents

- [Executive summary](#execsum)
- [Context](#context)
- [Methodology and model overview](#method)
	- [Collaborative filtering](#colab)
	- [Content-based filtering](#content)
	- [Recommendations using classification model probabilities](#classify)
	- [Model aggregation](#aggregation)
- [Variables](#var)
	- [Collaborative filtering](#colabvar)
	- [Content filtering](#contentvar)
	- [Classification models](#classifyvar)
- [Model evaluation](#evaluation)
	- [Collaborative filtering](#colabeval)
	- [Content filtering](#contenteval)
	- [Classification models](#classifyeval)
- [Appendix](#appendix)

## Executive summary<a name="execsum"></a>

- Collaborative filtering, content filtering and classification methods are used to recommend shares in certain London-listed public companies to investors

- Recommendations can be made for investors which are not in the training data, given the portfolio of said investor, making use of the content filtering model only

- For investors within the training data, recommendations are made using collaborative filtering also

- For the largest investors in the training data, classification methods are used to make recommendations in addition

- Recommendations from the models can be used separately or aggregated

- Data used comprise:
	- A core universe of $211$ companies and their features, for content filtering;
	- Holdings of $1561$ investors in a larger universe of $449$ companies, for collaborative filtering; and 
	- The top $30$ investors and their holdings in the core universe of $211$ companies, used for the classification models

## Context <a name="context"></a>

Institutional investors are firms which invest on behalf of a group of people professionally. Pension funds are a good example of this: as someone pays into their pension over their working life, they generally seek to obtain a return greater than that available e.g. through depositing cash in a bank. This entails choices about the myriad asset classes (e.g. public equity, fixed income, real estate, private equity), and about the many potential investments available within each asset class. To take UK equities as an example; there are over $2000$ companies listed on the London Stock Exchange at the time of writing and this number has been higher in the past. The potential for significant combinatorial complexity, portfolio effects[^fn_m] and the often nontrivial nature of evaluating individual potential investments means that much investment is therefore outsourced to professionals.

[^fn_m]: e.g. covariance between assets, _pace_ Markowitz.

Many large pension funds and insurers invest significantly in UK equities and a fund manager, either at the institution itself or at a third party asset manager, will have the job of selecting shares to invest in. The investment selection process can be seen as like a funnel, where ideas for specific shares are sought, along with a hypothesis as to why it may be a good investment, from a variety of sources. These can then be evaluated and the number of potential investments reduced.

Such buy-side firms will employ research analysts whose job is to unearth the best prospects within a given subset of potential investments e.g.  European technology companies, or US oil and gas companies. Additionally, fund managers meet management of many companies and will have their own ideas. Certain sell-side firms also do research, and salespeople at sell-side firms will provide ideas to fund managers with the goal of generating trading commission.

The goal of this project is to provide a proxy for this idea provision by sell-side salespeople through a recommendation engine. The project uses data for $211$ larger-capitalisation UK public companies for which good data are available and excludes banks and insurers, for the sake of simplicity when generating features from accounting data.[^fn0]

[^fn0]: The accounts of financial firms are significantly different to those of e.g. manufacturers and retailers.

## Methodology and model overview <a name="method"></a>

An almost arbitrary number of recommendations, $k$, can be provided.[^fn1] Recommendations are ranked, such that the best recommendation is included in any list of recommendations. $10$ recommendations is the default.

[^fn1]: Clearly these become less useful as $k \to 211$.

In order to provide overall recommendations, three separate approaches are used. These are:

1. Collaborative filtering
2. Content filtering
3. Binary classification model(s), particularly in relation to the probability they ascribe to the positive class

### 1. Collaborative filtering <a name="colab"></a>

Collaborative filtering is a common methodology in recommendation engines, using ratings of items given by users where available to estimate those that are not.

In the context of users, items and ratings, a missing rating is estimated by taking the overall average rating and adjusting this to reflect both the rating bias of the specific user and the specific item.

In the context of investors, shares and percentage shareholdings, a shareholding of $0$ -- i.e. a company in which the investor is not invested -- is deemed to be missing and is 'estimated' by taking the overall average shareholding  and adjusting this to reflect the holding bias of the specific investor and the holding bias of the specific company. The former of these should have more impact because the percentage shareholdings would be expected to sum to $100$. In practice, however, the sum of the percentage holdings will be $<100 \%$ because of the inherently approximate nature of third-party share register data.

By then ranking these 'estimated' holdings for a given investor, we can determine which $k$ potential investments to recommend.

### 2. Content-based filtering <a name="content"></a>

Content-based filtering is another common approach to recommendation engines, relying on similarity between items.

In the context of users, items and ratings, items are compared in order to establish their similarity, and then recommended based on their similarity to items which have been highly rated by a given user already.

In the context of investors, shares and percentage shareholdings, the features of the companies are compared. More specifically, for the portfolio of companies in which the investor is known to have invested, the most similar companies are found for each company. For each comparison of a company in the portfolio and the $210$ other companies for which we have data, cosine similarity is calculated. This can be thought of as a similarity score. 

These are then aggregated, so that the company which is similar to many companies in the portfolio is accorded a higher score than the company which is similar to only one. 

After this aggregation step, the top $k$ best scoring companies overall are returned.

### 3. Recommendations using binary classification model probabilities  <a name="classify"></a>

Classification models use supervised learning techniques to predict a classification associated with a given set of features.

In the context of investors, shares and percentage shareholdings, we know which investors have chosen to invest in which companies, either at all (i.e. they hold $0$ shares or $>0$ shares) or above a certain _de minimis_ threshold.

In this case a _de minimis_ threshold of $0.25 \%$ is used to filter the feature matrix before use. Holdings $<0.25 \%$ therefore become the negative and holdings $>0.25 \%$ become the positive class. A holding of $0.25 \%$, therefore, becomes the implied decision boundary for the estimates. The purpose here is to acknowledge the inexact nature of the register data and train the models on a somewhat stronger signal by raising the threshold somewhat. 

The choice of $0.25 \%$  is essentially arbitrary however it removes many of the investors in the holdings matrix, retaining around 1/8th of the holdings by reducing the number from $19914$ to $2568$. Missing values are then replaced with 0 and the holdings are binarised such that $y \to 1 \text{ if } y \gt 0$.

Shown below is the effect on the distribution.

![alt text](class_data_treat.png)

Three classification models are each trained on $80 \%$ of data and then the classification predictions (invested or not invested) are tested against the remaining $20 \%$ of data, the test set. The model that achieves the highest precision score on the test set is then used to make recommendations.

Because classification models work best when there is not a significant disparity between the classes -- in this case invested or not invested -- it is only practicable to run the models for some of the larger investors where class imbalance is within certain bounds.

For this reason, these models have been undertaken for the top $30$ investors only. Even so, there is still significant class imbalance at the top (L&G) and the bottom (Capital Research) of this subset. It was decided to stop at the top $30$ investors because class imbalance becomes more of an issue going down the ranked list as the negative class -- not invested -- becomes more and more predominant.

In order to deal with class imbalance, random oversampling of the least represented class is undertaken in order to train the models on balanced classes.

To then generate the predictions from the classification models, shares the investor already holds are discarded and the remaining shares are ordered by the probability of being invested. The top $k$ serve as recommendations, being those shares that the model thinks the investor is most likely to invest in, within the subset of shares that the investor is not actually invested in.

### Model aggregation  <a name="aggregation"></a>

As many of the three models are used as possible. By way of example:

- For an investor in the top $30$, collaborative filtering, content filtering and classification models are used, because such an investor will be in the training data for all three.

- For an unseen investor which is not in the set of $1561$ used to train the collaborative filtering model, content-based filtering can still be used as long as as their portfolio is known. In this case, the other two models would not be able to be used due to absence from the training data.

Given the $k$ recommendations from each of applicable models, the respective scores and/or probabilities are then scaled to $[0,1]$ so that the strength of a recommendation from each approach is comparable. These are then combined into one list and sorted with the top $k$ being taken again. To avoid any ties in the sorting, a small random value $[0,0.1]$ is added to each.

## Variables  <a name="var"></a>

### Collaborative filtering  <a name="colabvar"></a> 

A $1561 \times 449$ matrix of shareholdings is used as the training data for collaborative filtering, covering $1561$ investors and shares in $449$ companies, representing most of the FTSE All-Share ex-investment companies. These $1561$ investors are drawn from a larger dataset of $19914$, and represent those investors with $>5$ holdings within the matrix.

### Content filtering  <a name="contentvar"></a>

A $211 \times 327$ matrix of shareholdings is used as the training data for content filtering, representing shares in $221$ companies and $327$ features associated with those companies. The $221$ companies represent the subset of the FTSE All-Share ex-investment companies index of UK-listed public companies for which the full set of feature data is available at the time of writing, excluding banks and insurers.

These features include:

- Features derived from financial estimates -- using revenue, EBITDA, EBIT, PBT, EPS and DPS for FY1-3, where FY1 is the next forward financial year end e.g. Dec-19E, FY2 the next after that e.g. Dec-20E etc. See appendix for transformations on these data
- Type of business: Economic, Business, Industry Group, Industry and Activity classifications[^fn-trbc]
[^fn-trbc]: [TRBC](https://en.wikipedia.org/wiki/Thomson_Reuters_Business_Classification) is used for this.
- Certain market data:
	- Price/mean price target and mean buy/hold/sell recommendation, both proxies for sell-side endorsement; 
	- Number of EPS estimates, a proxy for sell-side coverage;
	- EPS estimate diffusion, a proxy for estimate revisions; and
	- NTM P/E, a yardstick for valuation

### Classification models  <a name="classifyvar"></a>

A $30 \times 211$ matrix is used as the training data for the classification models, being the top $30$ investors when ranked by number of investments in the $211$ companies included the feature matrix for the content filtering.

The top 30 investors by number of investments is shown below. This is after filtering to only include holdings $> 0.25 \%$.

![alt text](pic3.png)

## Model evaluation  <a name="evaluation"></a>

### Collaborative filtering  <a name="colabeval"></a>

Cross validation has been used to select an algorithm for collaborative filtering, with root mean squared error (RMSE) used to score the algorithm, with a lower RMSE being better.

|                 |    RMSE |
|:----------------|:-----------:|
| SVD             |     1.0393  |
| KNNBaseline     |     1.04404 |
| SVDpp           |     1.0452  |
| SlopeOne        |     1.04858 |
| KNNWithMeans    |     1.0524  |
| BaselineOnly    |     1.05362 |
| KNNWithZScore   |     1.07856 |
| CoClustering    |     1.11275 |
| NMF             |     1.11505 |
| KNNBasic        |     1.19528 |
| NormalPredictor |     1.53661 |

After fitting the model using SVD on a training set representing $80 \%$ of the data, the worst and the best predictions against a test set of $20 \%$ of the data were obtained and examined. SVD achieved an RMSE of $1.01$ on the test data, comparable to that from cross validation, and MAE of $0.364$.

Best and worst predictions from the model are shown below. The best predictions are not particularly informative as these were all small holdings which were correctly predicted. $I_u$ represents the number of holdings in the dataset belonging to a given investor and $U_i$ represents the number of investors that are thought to hold shares in the company. 

| inv                                       | co   |    rui |    est |   err |   Iu |   Ui |
|:-------------------------------------------------|:-------------|-------:|-------:|------:|-----:|:----:|
| VI Vorsorgeinvest AG                             | AZN.L        | 0.0001 | 0.0001 |     0 |    6 |  453 |
| DBX Advisors LLC.                                | MGGT.L       | 0.0001 | 0.0001 |     0 |   41 |  258 |
| China Tonghai Asset Management Limited           | BT.L         | 0.0001 | 0.0001 |     0 |   10 |  378 |
| Thurgauer Kantonalbank                           | CNA.L        | 0.0001 | 0.0001 |     0 |   66 |  279 |
| Desjardins Global Asset Management               | GLEN.L       | 0.0001 | 0.0001 |     0 |   55 |  383 |
| Affinity Investment Advisors, LLC                | TSCO.L       | 0.0001 | 0.0001 |     0 |   10 |  378 |
| Schroder Investment Management (Hong Kong) Ltd.  | AAL.L        | 0.0001 | 0.0001 |     0 |   10 |  446 |
| State Street Global Advisors Australia Ltd.      | ICAG.L       | 0.0001 | 0.0001 |     0 |   67 |  273 |
| BlackRock Investment Management (Australia) Ltd. | BT.L         | 0.0001 | 0.0001 |     0 |   35 |  378 |
| Fubon Asset Management Company Ltd.              | HSBA.L       | 0.0001 | 0.0001 |     0 |   31 |  422 |

For the best predictions, the mean $I_u$ is 33 and the mean $U_i$ is 365.

The worst predictions are all significant actual holdings with comparably low estimates. 

| inv                                | co   |     rui |      est |     err |   Iu |   Ui |
|:------------------------------------------|:-------------|--------:|---------:|--------:|-----:|:----:|
| BlackRock Investment Management (UK) Ltd. | CEY.L        | 15.0907 | 1.37646  | 13.7142 |  351 |   92 |
| Merian Global Investors (UK) Limited      | OSBO.L       | 15.1112 | 1.35902  | 13.7522 |  158 |  102 |
| Schroder Investment Management Ltd. (SIM) | STUS.L       | 18.974  | 5.11165  | 13.8624 |  257 |   29 |
| RWC Partners Limited                      | ITE.L        | 15.0252 | 1.10001  | 13.9252 |   60 |   56 |
| Allan Gray Proprietary Limited            | RDI.L        | 16.3576 | 1.77759  | 14.58   |    7 |   84 |
| MFS Investment Management                 | RB.L         | 15.8769 | 0.954271 | 14.9226 |  137 |  519 |
| Toscafund Asset Management LLP            | TALK.L       | 18.754  | 2.32253  | 16.4315 |   10 |   49 |
| Jupiter Asset Management Ltd.             | ARWA.L       | 18.3036 | 1.33923  | 16.9644 |  144 |   66 |
| Phoenix Asset Management Partners Ltd.    | DTY.L        | 26.7083 | 1.15564  | 25.5527 |   12 |   76 |
| INVESCO Asset Management Limited          | NRRT.L       | 28.5469 | 2.52872  | 26.0182 |  194 |   83 |

For the worst predictions, the mean $I_u$ is 133 and the mean $U_i$ is 116. The model therefore seems to perform better for smaller investors (i.e. investors who hold fewer companies) investing in companies with more concentrated share registers.

The histogram of the errors and absolute errors on the predictions made on the test set is below. As would be expected given RMSE $\sim 1 \%$, most predictions have small error with only a small number of predictions having significant error.

![alt text](pred_errors.png)

Comparing estimated and predicted holdings, it can be seen that collaborative filtering systematically over-estimates the smaller holdings and under-estimates the larger ones.

![alt text](pred_vs_act.png)

Below are shown precision and recall at $k$ using a $0.62 \%$ threshold. $0.62 \%$ has been used as this is the baseline (overall mean) for the feature matrix.

![alt text](pic2.png)

### Content filtering  <a name="contenteval"></a>

The content filtering model compares feature vectors using cosine similarity. As such it is deterministic given a certain set of features, and evaluation relates to feature selection. The acid test for the content filtering model is whether the user with material domain knowledge regards the recommendations as relevant.

By way of example:

Entering `Fuller, Smith and Turner` (ticker: FSTA.L), a pubco, returns the below cosine similarities:

```
MARS.L Marston's 89
WTB.L Whitbread 82
GRG.L Greggs 82
CPG.L Compass Group 77
RTN.L Restaurant Group 75
DOM.L Domino's Pizza Group 54
SSPG.L SSP Group 49
CINE.L Cineworld Group 34
GYM.L GYM Group 27
```

Of these results, Marstons is another pubco, Whitbread is a pubco and hotel operator, Greggs is a fast food retailer, Compass group is a contract food provider, Restaurant Group is a restaurant operator, Domino's is a fast food franchiser, SSP is an airport cater, Cineworld is a cinema operator and Gym Group is a gym operator.

These are all consumer-facing food or leisure business which are basically a play on UK consumer confidence and macro outlook.

With the input `Vodafone` (ticker: VOD.L), a large mobile phone company, returns:

```
TALK.L Talktalk Telecom Group 178
BT.L BT Group 175
RR.L Rolls-Royce Holdings 29
CAPCC.L Capital & Counties Properties 23
SPI.L Spire Healthcare Group 21
WMH.L William Hill 20
DC.L Dixons Carphone 19
CNA.L Centrica 17
FRES.L Fresnillo 16
```
TalkTalk is a telecoms provider as is BT and these business both have suitably high cosine similarities, with BT's $6 \times$ that of Rolls-Royce's. Dixons Carphone is potentially also relevant as a retainer of mobile phones, as well as other consumer electronics.

Although not precisely quantifiable,  precision at $k$, being the proportion of recommended items in the top-$k$ set that are relevant, seems good for these examples.

### Classification models  <a name="classifyeval"></a>

Logistic Regression, SVM and AdaBoost with random forest models were tested for each investor in the $30$ and the classes hold ($>0.25 \%$ holding) or do not hold ($<0.25 \%$ holding). 

The evaluation metrics were computed for each investor. Below are shown results for `BlackRock Advisors (UK) Limited`:

|               |   AdaBoost |    LogReg |      SVM | Best     |
|:--------------|-----------:|----------:|---------:|:--------:|
| Accuracy      |  0.744681  | 0.829787  | 0.744681 | LogReg   |
| Baseline      |  0.510638  | 0.510638  | 0.510638 | -        |
| Improvement   |  0.234043  | 0.319149  | 0.234043 | LogReg   |
| PR AUC        |  0.822325  | 0.873067  | 0.839429 | LogReg   |
| Precision     |  0.866667  | 0.857143  | 0.923077 | SVM      |
| ROC AUC       |  0.740942  | 0.828804  | 0.740036 | LogReg   |
| Recall        |  0.565217  | 0.782609  | 0.521739 | LogReg   |
| Train_CV_mean |  0.801938  | 0.759767  | 0.759599 | AdaBoost |
| Train_CV_sd   |  0.0499497 | 0.0515466 | 0.053361 | SVM      |

Predictions are produced for each of the three models over the whole dataset, both training and test. The predictions made by the model with the best precision on the test set -- in this case SVM -- are used, with the invested class probabilities ranked once the shares already held by that investor have been removed.

Precision scores for the various models over the $30$ top investors are shown below.

![alt text](pic5.png)

Overall, AdaBoost produces the highest mean precision score and the lowest standard deviation of precision scores.

|    |   AdaBoost |   LogReg |      SVM |
|:---|-----------:|---------:|:--------:|
| µ  |   0.856794 | 0.819546 | 0.844079 |
| σ  |   0.122042 | 0.14388  | 0.129495 |

Improvement in accuracy in comparison to the baseline is shown below.

![alt text](accuracy.png)

## Appendix  <a name="appendix"></a>

**Financial feature transformations** each undertaken for Revenue, EBITDA, EBIT, PBT, EPS and DPS are:

- ACAG
- GYP
- LRTP
- MEAN
- PPP
- SD
- TRTL

Each of these is calculated over the period FY1 to FY3. For a company which has reported 2018 annual financial results for the year ending 31 December 2018, FY1 is the year to 31 December 2019, FY2 the year to 31 December 2020, etc.

ACAG is adjusted CAGR -- essentially a CAGR metric with tweaks to account for a change in sign between start and end periods.

LRTP is latest period relative to peak period i.e. $FY_3 / \text{max}(FY_1,FY_2,FY_3)$

TRTL is the nadir period relative to the latest period i.e. $\text{min}(FY_1,FY_2,FY_3) / FY_3$

GYP is growth periods out of the number of periods less 1 e.g. if growth in FY2 and FY3 it will be $1$.

PPP is positive periods proportion i.e. the proportion of the periods for which the metric is $>0$.

SD and MEAN are used to calculate $\log_{10} (\bar x/\sigma_{x})$ and then discarded, with $\log_{10} (\bar x/\sigma_{x})$ being retained as a feature.

In [1]:
from mycode import *

from warnings import simplefilter
simplefilter(action='ignore', category=DeprecationWarning)

from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
pd.set_option('precision', 3)

In [2]:
df_sec = pd.read_pickle('./securites.pickle')
df_colab_recs = pd.read_pickle('./colab_recs.pickle')
class_probs = joblib.load('./prob_recs.sav')
cont_filt_sims = joblib.load('./co_similarities.sav')
df_cf_holdings = pd.read_pickle('./sparse_reg.pickle')

In [3]:
print('Institutional investors for which classification probabilities can be used:\n---\n')
print(sorted(list(class_probs.keys())))

Institutional investors for which classification probabilities can be used:
---

['AXA Investment Managers UK Ltd.', 'Aberdeen Asset Investments Limited', 'Aberdeen Standard Investments (Edinburgh)', 'Artemis Investment Management LLP', 'Aviva Investors Global Services Limited', 'BlackRock Advisors (UK) Limited', 'BlackRock Institutional Trust Company, N.A.', 'BlackRock Investment Management (UK) Ltd.', 'Capital Research Global Investors', 'Columbia Threadneedle Investments (UK)', 'Dimensional Fund Advisors, L.P.', 'Fidelity International', 'Fidelity Management & Research Company', 'HSBC Global Asset Management (UK) Limited', 'INVESCO Asset Management Limited', 'Investec Asset Management Ltd.', 'JPMorgan Asset Management U.K. Limited', 'Jupiter Asset Management Ltd.', 'Legal & General Investment Management Ltd.', 'M & G Investment Management Ltd.', 'Norges Bank Investment Management (NBIM)', 'Nuveen LLC', 'Royal London Asset Management Ltd.', 'Schroder Investment Management Ltd. (SIM)'

In [4]:
print('Cos for which content filtering can be used:\n---\n')
print(sorted(list(cont_filt_sims.keys())))

Cos for which content filtering can be used:
---

['AAAA.L', 'ABF.L', 'AGGK.L', 'ANTO.L', 'APTD.L', 'ARWA.L', 'ASCL.L', 'AVST.L', 'AVV.L', 'AZN.L', 'BAES.L', 'BAG.L', 'BALF.L', 'BATS.L', 'BBA.L', 'BBOXT.L', 'BDEV.L', 'BIFF.L', 'BKGH.L', 'BOY.L', 'BP.L', 'BRBY.L', 'BRW.L', 'BT.L', 'BVIC.L', 'BVS.L', 'CAL.L', 'CAPCC.L', 'CARDC.L', 'CCC.L', 'CCH.L', 'CEY.L', 'CINE.L', 'CKN.L', 'CLSH.L', 'CNA.L', 'COB.L', 'COSG.L', 'CPG.L', 'CPI.L', 'CRDA.L', 'CRH.L', 'CTEC.L', 'CWK.L', 'DC.L', 'DCC.L', 'DGE.L', 'DIAL.L', 'DOM.L', 'DPH.L', 'DRX.L', 'DSCV.L', 'DTY.L', 'DVO.L', 'ELM.L', 'EMG.L', 'EQN.L', 'ERM.L', 'ESNT.L', 'ETO.L', 'EVRE.L', 'EXPN.L', 'EZJ.L', 'FLTRF.L', 'FORT.L', 'FOUR.L', 'FRES.L', 'FSJ.L', 'FSTA.L', 'FUTR.L', 'GFS.L', 'GFTU_u.L', 'GHGG.L', 'GNC.L', 'GNK.L', 'GNS.L', 'GOCO.L', 'GPOR.L', 'GRG.L', 'GSK.L', 'GVC.L', 'GYM.L', 'HFD.L', 'HIK.L', 'HILS.L', 'HLMA.L', 'HSW.L', 'HTG.L', 'HWDN.L', 'IBST.L', 'ICAG.L', 'IGG.L', 'III.L', 'IMB.L', 'IMI.L', 'INCH.L', 'INF.L', 'IPF.L', 'ISA.L', 'ITRK.L', '

In [5]:
jitter_bug = lambda n: (0.01) * np.random.random_sample(size=n) # Random float [0,0.1)

series_minmax = lambda ps: pd.Series(MinMaxScaler().fit_transform(ps.values.reshape(-1, 1)).T[0], index=ps.index)

def flatten_list(list_):
    flat_l = []
    for i in range(len(list_)):
        flat_l += list_[i]
    return flat_l

def densify(df_s):                                 
    return pd.DataFrame({k:v for k,v in zip(df_s.columns, df_s.T.values)}, index=df_s.index)

def portfolio_content_recs(portfolio, results, n=10):
    # Get appropriate results
    p_result=[]
    
    for tidm in portfolio:
        try:
            p_result.append(results[tidm][1:50]) # To 50 is a search/speed trade off
        except: pass

    # Flatten list of lists
    p_result=flatten_list(p_result)

    # Remove anything already in the portfolio arg
    for _,tidm in p_result:
        if tidm in portfolio:
            p_result.remove((_,tidm))
        
    # Dataframe of recs
    t_=pd.DataFrame({n:(x[0],x[1]) for n,x in enumerate(p_result)}).T.rename(columns={0:'rec',1:'Instrument'})

    # Groupby so any duplicates are summed
    return t_.groupby('Instrument').rec.sum().sort_values(ascending=False)[:n]

# Core algorithm
# Inputs are: Investor OR holdings
# Outputs are: n recommendations from each applicable methodology

def recommend(investor = None, holdings = None, single_list = True, n = 10):
    if (investor == None and holdings == None):
        return
    
    # Collaborative filtering
    
    try:
        collab_recs = df_colab_recs.loc[investor, :].sort_values(ascending = False)[:n]
        collab_recs_scaled = series_minmax(collab_recs) + jitter_bug(len(collab_recs))
    except:
        print('Not using collaborative filtering')
        collab_recs_scaled = None
    
    # Classification probabilities
    
    if investor in class_probs:
        best_model = class_probs[investor]['stats'].loc['Precision','Best']
        preds = class_probs[investor]['preds'][best_model] # Take all classifier predictions
        class_pred_recs = preds[preds.hold == 0].p.sort_values(ascending = False)[:n] # Narrow down to top n (most likley) and unheld
        class_pred_recs_scaled = series_minmax(class_pred_recs) + jitter_bug(len(class_pred_recs))
    else:
        print('Not using classification probabilities')
        class_pred_recs_scaled = None
        
    # Content filtering
    
    if holdings == None: # Relying on investor
        try:
            portfolio = df_cf_holdings.loc[investor, :].dropna().index.to_list() 
            content_recs = portfolio_content_recs(portfolio, cont_filt_sims)
            content_recs_scaled = series_minmax(content_recs) + jitter_bug(len(content_recs))
        except:
            print('Not using content filtering')
            content_recs_scaled = None
    else:
        portfolio = holdings
        try:
            content_recs = portfolio_content_recs(portfolio, cont_filt_sims)
            content_recs_scaled = series_minmax(content_recs) + jitter_bug(len(content_recs))
        except:
            print('Not using content filtering')
            content_recs_scaled = None

    if (single_list): 
        return pd.concat([collab_recs_scaled, class_pred_recs_scaled, content_recs_scaled], axis = 0).sort_values(ascending = False)[:n]
    else: 
        return collab_recs_scaled, class_pred_recs_scaled, content_recs_scaled
    
# Wrapper helper function

def recs_friendly_wrapper(investor = None, holdings = None):
    t_ = recommend(investor, holdings).groupby(level=0).sum().sort_values(ascending=False) # groupby sums any duplicates

    print('\n')
    for ticker, score in t_.iteritems():
        print('{} {} {:.2f}'.format(ticker, get_name(ticker, df_sec), score))
    return

In [6]:
df_cf_holdings = densify(df_cf_holdings)
df_cf_holdings = df_cf_holdings.replace(0,np.nan)

In [7]:
# Example of three seperate lists
i_ = 'Investec Asset Management Ltd.'
t_ = recommend(i_, None, single_list = False)

print('All three lists for', i_)

print('\nCollaborative filtering recs\n---\n\n', t_[0])
print('\nClass prediction recs\n---\n\n', t_[1])
print('\nContent filtering recs\n---\n\n', t_[2])

All three lists for Investec Asset Management Ltd.

Collaborative filtering recs
---

 KMR.L      1.007
STUS.L     0.943
NXR.L      0.866
RESIR.L    0.703
CNCTC.L    0.490
EPICE.L    0.307
BIFF.L     0.270
RECIV.L    0.220
AEWU.L     0.059
BALF.L     0.004
dtype: float64

Class prediction recs
---

 ITRK.L    1.001
SGRO.L    0.920
GFS.L     0.726
JD.L      0.406
GPOR.L    0.297
CRH.L     0.223
BAES.L    0.157
SAFE.L    0.073
WEIR.L    0.069
IWG.L     0.004
dtype: float64

Content filtering recs
---

 Instrument
GOCO.L    1.003e+00
PFC.L     8.477e-01
DIAL.L    6.254e-01
KMR.L     5.740e-01
FRES.L    4.985e-01
EVRE.L    4.408e-01
GNK.L     3.876e-01
SRP.L     2.153e-01
NMC.L     5.720e-02
PSN.L     6.515e-04
dtype: float64


In [8]:
# Combined list based on collaborative filtering and content filtering lists

i_ = 'California State Teachers Retirement System'
print('Combined list based on collaborative filtering and content filtering lists for', i_, '\n')
recs_friendly_wrapper(i_, None)

Combined list based on collaborative filtering and content filtering lists for California State Teachers Retirement System 

Not using classification probabilities


RESIR.L Residential Secure Income PLC 1.01
FRES.L Fresnillo PLC 1.00
DIAL.L Dialight PLC 0.84
AEWU.L AEW UK REIT PLC 0.82
BVC.L Batm Advanced Communications Ltd 0.81
KMR.L Kenmare Resources PLC 0.80
EPICE.L Ediston Property Investment Company PLC 0.64
SERE.L Schroder European Real Estate Investment Trust PLC 0.38
SOHO.L Triple Point Social Housing REIT PLC 0.30
APTD.L Aptitude Software Group PLC 0.27


In [9]:
# Combined list based on all three lists
i_ = 'BlackRock Advisors (UK) Limited'
print('Combined list based on all three lists for', i_)
recs_friendly_wrapper(i_, None)

Combined list based on all three lists for BlackRock Advisors (UK) Limited


SDR.L Schroders PLC 1.61
DIAL.L Dialight PLC 1.01
WEIR.L Weir Group PLC 1.01
SERE.L Schroder European Real Estate Investment Trust PLC 1.00
RESIR.L Residential Secure Income PLC 0.68
AEWU.L AEW UK REIT PLC 0.64
DPEU.L DP Eurasia NV 0.38
KMR.L Kenmare Resources PLC 0.31
SRP.L Serco Group PLC 0.29


In [10]:
# Content filtering list based on portfolio only (no knowledge of investor)

print('Content filtering list based on portfolio only (no knowledge of investor)\n')

recs_friendly_wrapper(None, ['WPP.L', 'ITV.L'])

Content filtering list based on portfolio only (no knowledge of investor)

Not using collaborative filtering
Not using classification probabilities


PSON.L Pearson PLC 1.01
FOUR.L 4imprint Group PLC 0.57
BKGH.L Berkeley Group Holdings PLC 0.51
GNK.L Greene King PLC 0.38
INF.L Informa PLC 0.23
EVRE.L EVRAZ plc 0.18
CARDC.L Card Factory PLC 0.11
BT.L BT Group PLC 0.05
MKS.L Marks and Spencer Group PLC 0.02
ETO.L Entertainment One Ltd 0.00
