# ClusterFinTwit
## Grouping financial Twitter accounts using topic analysis 
(and, failing that, by any means necessary)

### Druce Vertes
### Metis Data Science Bootcamp
### November 16, 2018


# Source Data:

![streeteye.png](streeteye.png)

# Source Data:
- Panel of ~600 Financial Twitter Screen Names

- Identify and follow most influential people who frequently share financial news

- Studied 250,000 URLs shared October 2017-mid-November 2018

- Financial journalists

- Professional pundits

- Finance pros who position themselves as experts

- Academics

# Motivation
![recommend.png](recommend.png)

# Motivation:
- Recommend people to follow on Twitter 
    - Problem: Discovering users to follow on Twitter is hard
    - Solution: Understand how often users tweet on different topics
    - Benefits: Better recommendations, clarity on why to follow: This person's website is AbnormalReturns.com, other websites they tweet, how many times a day they tweet on what topics by percentage

- Improve ability to organize news by topic (Asia or crypto page/filter)

- Understand FinTwit structure

- Cool visuals and viral blog post

# Methodology

1) Preprocess headlines of stories they share (not actual tweets)

2) Lemmatize

3) Tokenize common ngrams

4) Model & cluster topics

5) Model & cluster users

6) Visualize

# 1. Preprocess

- Problem: u.k. -> 'u' , 'k' -> 💨

- Problem: 's&p 500', 's & p 500', 's. & p. 500', 's and p 500' -> 's' , 'p' -> 💨 

- Problem: 'the fed' -> 'feed' 💩

- Solution: 'u_k', 's_p_500', 'the_fed' 👍🏻

# 2. Lemmatize

- 'Allies' -> 'ally'
- 'Exploring' -> 'explore'
- Treat all variations of words the same for purpose of topic analysis

# 3. Tokenize ngrams (after lemmatizing)

- 'social' suggests one set of topics
- 'security' suggest another set of topics
- 'social security' suggests a third set of topics
- Combine into 'social_security'
- 300 common ngrams


# Ngrams
    (('associated', 'press'), 1966),
    (('attorney', 'general'), 431),
    (('bank', 'america'), 1014),
    (('bank', 'england'), 1594),
    (('barack', 'obama'), 253),
    (('basic', 'income'), 105),
    (('bbc', 'radio'), 264),
    (('ben', 'carson'), 1063),
    (('bernie', 'sander'), 5528),
    (('big', 'data'), 476),
    (('bill', 'gate'), 234),
    (('bond', 'market'), 1200),
    (('border', 'wall'), 112),
    (('central', 'bank'), 6407),
    (('charge', 'with'), 251),
    (('chief', 'executive'), 3278),
    (('chris', 'christie'), 1023),
    (('climate', 'change'), 2769),
    (('climate', 'deal'), 112),
    (('credit', 'card'), 1356),
    (('credit', 'suisse'), 1345),

# Result: Corpus

    China’s Xi Jinping hits out at ‘law of the jungle’ trade policies
    UK business leaders call for ‘people’s vote’ on Brexit deal
    Theresa May to warn pro-Brexit ministers time is running out
    MGM Casino exploring Caesars merger: sources
    China Seeks Allies as Trump’s Trade War Mounts. It Won’t Be Easy.
    Wealthy Americans Assure Populace That Heavily Armed Floating City Being Built Above Nation Has Nothing To Do With Anything
    Fewer Stars to Rise at Goldman Sachs as Partnership Class Shrinks
    S&P 500 Earnings Season Update: November 2, 2018
    
    china xi jinping hit law jungle trade policy
    uk business leader call people vote brexit deal
    theresa_may warn pro brexit minister time run
    mgm casino explore caesar merger source
    china seek ally trump trade_war mount win easy
    wealthy american assure populace heavily arm float city build nation nothing anything
    star rise goldman_sachs partnership class shrink
    s_p_500 earnings season update november 2018


# 4. Topic analysis

- Train corpus on 80 Topics
    - LSI (Latent Semantic Indexing)
    - NMF (Non-negative Matrix Factorization)
    - LDA (Latent Dirichlet Allocation)
        - rejected - slow, poor results on short docs
- Word2Vec (300-vectors)
    - Initialized to Google News
    - Trained further using the headline corpus
    - Average vectors for each headline

## Start Word2vec training
![w2v1.png](w2v1.png)

## End Word2vec training
![w2v2.png](w2v2.png)

In [30]:
import pickle
import random

from sklearn.manifold import TSNE

import plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
import plotly.figure_factory as ff
init_notebook_mode(connected=True)

pickle_list = pickle.load( open( "embeddings.pkl", "rb" ) )
embed_dict, embed_reverse, embeddings = pickle_list

chart_width=720
chart_height=580

def do_tsne(doc_vectors, perplexity=5):
    tsne = TSNE(perplexity=perplexity,
                n_components=2,
                init='pca',
                n_iter=5000)

    two_d_vectors = tsne.fit_transform(doc_vectors)
    return two_d_vectors

def headline_scatterplot(vectors, labels, clusters=None, title="", justlabels=False):
    if clusters is not None:
        clusterdict=dict(
            size=5,
            color = clusters,
            colorscale='Jet',
            showscale=True
        )
    else:
        clusterdict=dict(
            size=5,
            showscale=False
        )

    if justlabels:
        mode = 'text'
        hoverlabel = dict(
            bgcolor = '#FF99CC'
        )
    else:
        mode = 'markers'
        hoverlabel = dict(
            bgcolor = '#CCCCCC',
            font = dict(
                family='serif',
                size=16,
                color='#000000'
            )
        )

    trace = Scatter(
        x = vectors[:,0],
        y = vectors[:,1],
        text = labels,
        hoverinfo = 'text',
        hoverlabel= hoverlabel,
        mode = mode,
        textfont = dict(
            family='sans serif',
            size=10,
            color='#1f77b4'       
        ),
        marker=clusterdict
    )
    data = [trace]
    
    layout = Layout(
        title = title,
        height=chart_height,
        width=chart_width,
#        margin=dict(b=kwargs['bottom_margin']),
    )

    # Plot and embed in ipython notebook!
    iplot(Figure(data=data,layout=layout), filename='basic-scatter')

embedrows, embedcols = embeddings.shape
# 100 most popular + 150 random
randlist = list(range(1,100)) + [i+100 for i in sorted(random.sample(range(embedrows-100),150))]
randtokens = [embed_reverse[r] for r in randlist]


In [28]:
headline_scatterplot(do_tsne(embeddings[randlist]), randtokens, title="Word Vectors - TSNE", justlabels=True)

# 4. Topic Analysis

- Concatenate document vectors from all 3 (LDI, NMF, average Word2vec by document): 460 columns
- PCA to 80 columns
- Plot topics with TSNE


In [31]:
reduced_data, ohs, plotclusters = pickle.load(open( "docvecs.pkl", "rb" ))
headline_scatterplot(do_tsne(reduced_data, perplexity=5), ohs, title="Headline Vectors - TSNE")


# 5. Cluster Analysis - Headlines
- Some clustering
- But if a paper had sections like this I'd throw it out

# 5. Cluster Analysis - Headlines

![Silhoutte1.png](Silhouette1.png)

# 5. Cluster Analysis - Headlines

![Silhoutte2.png](Silhouette2.png)

# 5. Cluster Analysis - Users

- For each FinTweep, average the vectors across all headlines they shared

In [4]:
vecs, labels, clusters = pickle.load(open( "tweep_by_topic.pkl", "rb" ))
headline_scatterplot(do_tsne(vecs, perplexity=5), labels, clusters=clusters, title="User Topic Vectors - TSNE")



# 5. Cluster Analysis

- But topics aren't all we have
- In many case FinTweeps shared the same actual URL
- The more URLs they have in common, the more likely they are similar
    - For ReformedBroker, we can say, he shared 1000 unique URLs
    - AbsoluteReturn shared 800 unique URLs
    - 40 URLs were shared by both
    - ReformedBroker overlapped with AbsoluteReturn on 4% of his shares
    - Conversely AbsoluteReturn overlapped with ReformedBroker on 5% of his shares
    - Create 500 x 500 matrix of % of URLs shared with each other (coshare matrix)


# 5. Cluster Analysis

But wait, there's more!
- We have even more info
- AbsoluteReturn and ReformedBroker follow each other
- Build a followers matrix of who follows whom
- Do PCA on all 3 matrices
    - Topics: 500x80 matrix of reduced topics by user
    - Coshares: 500x500 coshares matrix
    - Followers: 500x500 followers matrix
    - PCA and chart with TSNE
TODO: chart    

In [5]:
allvecs, ohs, clusters = pickle.load(open( "tweepvecs.pkl", "rb" ))
headline_scatterplot(do_tsne(allvecs, perplexity=5), ohs, clusters=clusters, title="User Vectors - TSNE")


# 6. Next Steps

[More fancy graph](http://dig-eh.org/dig-eh/TopicModelling/CircularDisciplines/)

# Conclusion
1. Discovery is hard
2. Navigating relationships in social media is hard
3. Trilemma
    - Topic
    - Influence
    - Time
    

# Million Dollar Kickstarter:

## Footpedals and Helmet to navigate the FinTwitterSphere 
(and StackOverflow, Reddit)

<div><img src="helmet.jpg" width="340" align="left"><img src="pedals.jpg" width="300" align="right"></div>

    

# Questions?