# Gorbechov Phase 3

Basic Premise: This is after the pipeline to produce the final.csv with all stock and blog data.

In [4]:
import os
os.path.exists("final.csv")

True

In [38]:
import matplotlib.pyplot as plt

In [6]:
import fastai
print(fastai)

<module 'fastai' from '/usr/local/lib/python3.6/dist-packages/fastai/__init__.py'>


In [7]:
%load_ext autoreload
%autoreload 2

%matplotlib inline


In [10]:
from fastai.imports import *
# from fastai.structured import * # Probably a different version of fastai

import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics

## Basic exploration of data

Although this is our own dataset, not a bad refresher for if we have to revisit it.

In [16]:
df_raw = pd.read_csv(f'final.csv', low_memory=False, delimiter='\t')

In [18]:
df_raw.columns = ['Symbol', 'Author', 'Industry', 'Sector', 'Type', 'Employees', 'Success']

In [21]:
df_raw.describe()

Unnamed: 0,Symbol,Author,Industry,Sector,Type,Employees,Success
count,25139,25139,24811.0,24811.0,25139,25139.0,25139
unique,1516,596,9.0,132.0,7,303.0,2
top,OHI,Brad Thomas,,,cs,,True
freq,460,18094,19524.0,19524.0,24222,19521.0,14877


### Some interesting notes about this data: 
    - There really aren't that many different types of stocks considered.  1,500 unique out of 25,000
    - About 600 different authors make up this dataset, with Brad Thomas leading the way (18,000 / 25,139)
    - Industry and Sector are kind of useless, with only 5,000 of the 25,000 being something other than None
    - Employees is also pretty useless.
    - Most of these are common stocks.
    - Good news is that we are ~15,000/25,000 (60%) with success, so that's at least not terrible

### Let's look at how Brad Thomas is doing

In [30]:
df_raw[df_raw.Author == "Brad Thomas"].describe()

Unnamed: 0,Symbol,Author,Industry,Sector,Type,Employees,Success
count,18094,18094,17790.0,17790.0,18094,18094.0,18094
unique,96,1,4.0,9.0,2,16.0,2
top,O,Brad Thomas,,,cs,,True
freq,385,18094,15244.0,15244.0,18088,15242.0,10552


In [31]:
10552/18094

0.5831767436719354

In [34]:
df_raw[df_raw.Author == "Brad Thomas"][df_raw.Symbol == "O"].describe()

  """Entry point for launching an IPython kernel.


Unnamed: 0,Symbol,Author,Industry,Sector,Type,Employees,Success
count,385,385,385,385,385,385,385
unique,1,1,2,2,1,2,2
top,O,Brad Thomas,Financial,REIT - Retail,cs,184,True
freq,385,385,384,384,385,384,223


In [35]:
223/385

0.5792207792207792

What's this all mean?  Basically, I'm looking at Brad Thomas and his most popular pick, "O", which is apparently a REIT with 184 or so employees.  Looks like overall, Brad is 5% gains 58% of the time, with O being slightly lower, with 57.9% correct 

## Replace the Nones with NaNs

In [45]:
df_raw.replace('None', np.nan, inplace=True)

## Fix Categorical Data
For the first go around, I think I'll just try One Hot Encoding, nothing amazing.  What I'm really looking at here is if we can get a reasonably succesful model without horridly overfitting - that is to say, is there any useful signal in this data or is it just a coin flip?

In [62]:
df_raw.head()

Unnamed: 0,Symbol,Author,Industry,Sector,Type,Employees,Success
0,ARI,Brad Thomas,,,cs,,True
1,BXMT,Brad Thomas,,,cs,,False
2,CCI,Brad Thomas,Technology,Diversified Communication Services,cs,5000.0,True
3,CHCT,Brad Thomas,,,cs,,True
4,CLDT,Brad Thomas,,,cs,,True


In [136]:
df_encoded = pd.get_dummies(df_raw)

In [137]:
df_encoded.head()

Unnamed: 0,Success,Symbol_AAL,Symbol_AAOI,Symbol_AAON,Symbol_AAPL,Symbol_ABB,Symbol_ABBV,Symbol_ABC,Symbol_ABDC,Symbol_ABEO,...,Employees_90000,Employees_92000,Employees_92400,Employees_93000,Employees_9400,Employees_955,Employees_9700,Employees_97000,Employees_9760,Employees_98000
0,True,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,False,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,True,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,True,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,True,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [138]:
df_encoded.describe()

Unnamed: 0,Symbol_AAL,Symbol_AAOI,Symbol_AAON,Symbol_AAPL,Symbol_ABB,Symbol_ABBV,Symbol_ABC,Symbol_ABDC,Symbol_ABEO,Symbol_ABT,...,Employees_90000,Employees_92000,Employees_92400,Employees_93000,Employees_9400,Employees_955,Employees_9700,Employees_97000,Employees_9760,Employees_98000
count,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,...,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0,25139.0
mean,4e-05,0.000199,4e-05,0.004177,0.000835,0.001273,0.000119,0.000239,0.000159,0.000278,...,0.000318,4e-05,0.000875,0.001233,4e-05,4e-05,0.000119,0.001432,0.000159,8e-05
std,0.006307,0.014102,0.006307,0.064494,0.028891,0.035656,0.010924,0.015448,0.012613,0.016685,...,0.017837,0.006307,0.02957,0.035095,0.006307,0.006307,0.010924,0.037816,0.012613,0.008919
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [71]:
labels = np.array(df_encoded["Success"])

In [73]:
labels[:5]

array([ True, False,  True,  True,  True])

In [139]:
df_encoded = df_encoded.drop('Success', axis=1)


In [76]:
df_encoded.head()

Unnamed: 0,Symbol_AAL,Symbol_AAOI,Symbol_AAON,Symbol_AAPL,Symbol_ABB,Symbol_ABBV,Symbol_ABC,Symbol_ABDC,Symbol_ABEO,Symbol_ABT,...,Employees_90000,Employees_92000,Employees_92400,Employees_93000,Employees_9400,Employees_955,Employees_9700,Employees_97000,Employees_9760,Employees_98000
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [77]:
feature_list = list(df_encoded.columns)

In [78]:
features = np.array(df_encoded)

In [79]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

In [80]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (18854, 2558)
Training Labels Shape: (18854,)
Testing Features Shape: (6285, 2558)
Testing Labels Shape: (6285,)


## Try out Random Forests

In [127]:
# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
# Instantiate model with 10,000 decision trees
rf = RandomForestClassifier(n_estimators = 10000, random_state = 42, n_jobs=4)
# Train the model on training data
rf.fit(train_features, train_labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10000,
                       n_jobs=4, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

Woops - we forgot validation data.

Also, Scikit doesnt use the GPU, which is tragic

Using https://towardsdatascience.com/random-forest-in-python-24d0893d51c0 as a guide

In [128]:
import pickle

In [129]:
pickle.dump(rf, open("random_forrest_3", "wb"))

In [130]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

In [131]:
predictions

array([ True,  True,  True, ...,  True,  True,  True])

In [132]:
test_labels

array([False, False,  True, ...,  True,  True,  True])

In [133]:
len([True for ind,x in enumerate(test_labels) if test_labels[ind] == predictions[ind]])/len(test_labels)

0.6496420047732697

That's not looking promising at all.

## Variable Importance

In [134]:
# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];


Variable: Symbol_STWD          Importance: 0.05
Variable: Symbol_BXMT          Importance: 0.03
Variable: Symbol_ARI           Importance: 0.02
Variable: Author_Brad Thomas   Importance: 0.02
Variable: Symbol_APTS          Importance: 0.01
Variable: Symbol_CHCT          Importance: 0.01
Variable: Symbol_CHMI          Importance: 0.01
Variable: Symbol_CONE          Importance: 0.01
Variable: Symbol_CTRE          Importance: 0.01
Variable: Symbol_KIM           Importance: 0.01
Variable: Symbol_LAND          Importance: 0.01
Variable: Symbol_RHP           Importance: 0.01
Variable: Symbol_ROIC          Importance: 0.01
Variable: Symbol_SBRA          Importance: 0.01
Variable: Symbol_TRTX          Importance: 0.01
Variable: Symbol_WPC           Importance: 0.01
Variable: Author_Dividend Sensei Importance: 0.01
Variable: Author_Dividend Sleuth Importance: 0.01
Variable: Author_Jeff Miller   Importance: 0.01
Variable: Author_Mike Nadel    Importance: 0.01
Variable: Type_cef             Impor

# Conclusions

From our preliminary analysis, using the Seeking Alpha Editor's Picks bloggers to beat a monkey and a dartboard doesn't seem promising.  There are probably other features that could be added in (word count, writing level, etc.), but nothing I've seen so far looks promising.  

Stocks are always hard since a) they have surprisingly little data (unless you're doing the intra-day or options work), and b) the price is a pretty processed signal - it's hard to find new insights, since they would be so incredibly valuable.

Some caveats:
    - I'm not looking at dividends, only stock price.  There are some games the editors may be playing, but I'm ignoring that.  I'm also not taking into account stock splits, which would be somewhat rare in the 90 day windows I'm looking at.
    - Many industries and sectors were unavailable.  However, many of the stock picks discussed were uncommon company types, i.e. REITs, ETFs, etc.
    - It's possible that the articles were discussing long term impacts that wouldn't be realized in 90 days, but I feel that more factors will contribute after 90 days than the articles could take into account.  In other words, the price would rise or fall due to factors that the article would have no knowledge about.
    
Overall observations:
    - The editor's picks definitely seem to have favorites.  Author Brad Thomas was a major contributor, yet didn't have a higher than average success rate (in fact, his success rate largely determined the average).
    - Certain stocks were picked often, but also didn't move the needle in terms of being succesful (greater than 5% max price increase in 90 days following the article).
    
## So in answer to the original question: Are the Editor Picks of Seeking Alpha better than randomly throwing a dart at the WSJ?  A little bit, but not enough to consistently bet on.  