
<br>
=======================================<br>
Visualizing the stock market structure<br>
=======================================<br>
This example employs several unsupervised learning techniques to extract<br>
the stock market structure from variations in historical quotes.<br>
The quantity that we use is the daily variation in quote price: quotes<br>
that are linked tend to cofluctuate during a day.<br>
.. _stock_market:<br>
Learning a graph structure<br>
--------------------------<br>
We use sparse inverse covariance estimation to find which quotes are<br>
correlated conditionally on the others. Specifically, sparse inverse<br>
covariance gives us a graph, that is a list of connection. For each<br>
symbol, the symbols that it is connected too are those useful to explain<br>
its fluctuations.<br>
Clustering<br>
----------<br>
We use clustering to group together quotes that behave similarly. Here,<br>
amongst the :ref:`various clustering techniques <clustering>` available<br>
in the scikit-learn, we use :ref:`affinity_propagation` as it does<br>
not enforce equal-size clusters, and it can choose automatically the<br>
number of clusters from the data.<br>
Note that this gives us a different indication than the graph, as the<br>
graph reflects conditional relations between variables, while the<br>
clustering reflects marginal properties: variables clustered together can<br>
be considered as having a similar impact at the level of the full stock<br>
market.<br>
Embedding in 2D space<br>
---------------------<br>
For visualization purposes, we need to lay out the different symbols on a<br>
2D canvas. For this we use :ref:`manifold` techniques to retrieve 2D<br>
embedding.<br>
Visualization<br>
-------------<br>
The output of the 3 models are combined in a 2D graph where nodes<br>
represents the stocks and edges the:<br>
- cluster labels are used to define the color of the nodes<br>
- the sparse covariance model is used to display the strength of the edges<br>
- the 2D embedding is used to position the nodes in the plan<br>
This example has a fair amount of visualization-related code, as<br>
visualization is crucial here to display the graph. One of the challenge<br>
is to position the labels minimizing overlap. For this we use an<br>
heuristic based on the direction of the nearest neighbor along each<br>
axis.<br>


Author: Gael Varoquaux gael.varoquaux@normalesup.org<br>
License: BSD 3 clause

In [None]:
import sys

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection

In [None]:
import pandas as pd

In [None]:
from sklearn import cluster, covariance, manifold

In [None]:
print(__doc__)

#############################################################################<br>
Retrieve the data from Internet

The data is from 2003 - 2008. This is reasonably calm: (not too long ago so<br>
that we get high-tech firms, and before the 2008 crash). This kind of<br>
historical data can be obtained for from APIs like the quandl.com and<br>
alphavantage.co ones.

In [None]:
symbol_dict = {
    'TOT': 'Total',
    'XOM': 'Exxon',
    'CVX': 'Chevron',
    'COP': 'ConocoPhillips',
    'VLO': 'Valero Energy',
    'MSFT': 'Microsoft',
    'IBM': 'IBM',
    'TWX': 'Time Warner',
    'CMCSA': 'Comcast',
    'CVC': 'Cablevision',
    'YHOO': 'Yahoo',
    'DELL': 'Dell',
    'HPQ': 'HP',
    'AMZN': 'Amazon',
    'TM': 'Toyota',
    'CAJ': 'Canon',
    'SNE': 'Sony',
    'F': 'Ford',
    'HMC': 'Honda',
    'NAV': 'Navistar',
    'NOC': 'Northrop Grumman',
    'BA': 'Boeing',
    'KO': 'Coca Cola',
    'MMM': '3M',
    'MCD': 'McDonald\'s',
    'PEP': 'Pepsi',
    'K': 'Kellogg',
    'UN': 'Unilever',
    'MAR': 'Marriott',
    'PG': 'Procter Gamble',
    'CL': 'Colgate-Palmolive',
    'GE': 'General Electrics',
    'WFC': 'Wells Fargo',
    'JPM': 'JPMorgan Chase',
    'AIG': 'AIG',
    'AXP': 'American express',
    'BAC': 'Bank of America',
    'GS': 'Goldman Sachs',
    'AAPL': 'Apple',
    'SAP': 'SAP',
    'CSCO': 'Cisco',
    'TXN': 'Texas Instruments',
    'XRX': 'Xerox',
    'WMT': 'Wal-Mart',
    'HD': 'Home Depot',
    'GSK': 'GlaxoSmithKline',
    'PFE': 'Pfizer',
    'SNY': 'Sanofi-Aventis',
    'NVS': 'Novartis',
    'KMB': 'Kimberly-Clark',
    'R': 'Ryder',
    'GD': 'General Dynamics',
    'RTN': 'Raytheon',
    'CVS': 'CVS',
    'CAT': 'Caterpillar',
    'DD': 'DuPont de Nemours'}

In [None]:
symbols, names = np.array(sorted(symbol_dict.items())).T

In [None]:
quotes = []

In [None]:
for symbol in symbols:
    print('Fetching quote history for %r' % symbol, file=sys.stderr)
    url = ('https://raw.githubusercontent.com/scikit-learn/examples-data/'
           'master/financial-data/{}.csv')
    quotes.append(pd.read_csv(url.format(symbol)))

In [None]:
close_prices = np.vstack([q['close'] for q in quotes])
open_prices = np.vstack([q['open'] for q in quotes])

The daily variations of the quotes are what carry most information

In [None]:
variation = close_prices - open_prices

#############################################################################<br>
Learn a graphical structure from the correlations

In [None]:
edge_model = covariance.GraphicalLassoCV()

standardize the time series: using correlations rather than covariance<br>
is more efficient for structure recovery

In [None]:
X = variation.copy().T
X /= X.std(axis=0)
edge_model.fit(X)

#############################################################################<br>
Cluster using affinity propagation

In [None]:
_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()

In [None]:
for i in range(n_labels + 1):
    print('Cluster %i: %s' % ((i + 1), ', '.join(names[labels == i])))

#############################################################################<br>
Find a low-dimension embedding for visualization: find the best position of<br>
the nodes (the stocks) on a 2D plane

We use a dense eigen_solver to achieve reproducibility (arpack is<br>
initiated with random vectors that we don't control). In addition, we<br>
use a large number of neighbors to capture the large-scale structure.

In [None]:
node_position_model = manifold.LocallyLinearEmbedding(
    n_components=2, eigen_solver='dense', n_neighbors=6)

In [None]:
embedding = node_position_model.fit_transform(X.T).T

#############################################################################<br>
Visualization

In [None]:
plt.figure(1, facecolor='w', figsize=(10, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])
plt.axis('off')

Display a graph of the partial correlations

In [None]:
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)

Plot the nodes using the coordinates of our embedding

In [None]:
plt.scatter(embedding[0], embedding[1], s=100 * d ** 2, c=labels,
            cmap=plt.cm.nipy_spectral)

Plot the edges

In [None]:
start_idx, end_idx = np.where(non_zero)
# a sequence of (*line0*, *line1*, *line2*), where::
#            linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [[embedding[:, start], embedding[:, stop]]
            for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])
lc = LineCollection(segments,
                    zorder=0, cmap=plt.cm.hot_r,
                    norm=plt.Normalize(0, .7 * values.max()))
lc.set_array(values)
lc.set_linewidths(15 * values)
ax.add_collection(lc)

Add a label to each node. The challenge here is that we want to<br>
position the labels to avoid overlap with other labels

In [None]:
for index, (name, label, (x, y)) in enumerate(
        zip(names, labels, embedding.T)):
    dx = x - embedding[0]
    dx[index] = 1
    dy = y - embedding[1]
    dy[index] = 1
    this_dx = dx[np.argmin(np.abs(dy))]
    this_dy = dy[np.argmin(np.abs(dx))]
    if this_dx > 0:
        horizontalalignment = 'left'
        x = x + .002
    else:
        horizontalalignment = 'right'
        x = x - .002
    if this_dy > 0:
        verticalalignment = 'bottom'
        y = y + .002
    else:
        verticalalignment = 'top'
        y = y - .002
    plt.text(x, y, name, size=10,
             horizontalalignment=horizontalalignment,
             verticalalignment=verticalalignment,
             bbox=dict(facecolor='w',
                       edgecolor=plt.cm.nipy_spectral(label / float(n_labels)),
                       alpha=.6))

In [None]:
plt.xlim(embedding[0].min() - .15 * embedding[0].ptp(),
         embedding[0].max() + .10 * embedding[0].ptp(),)
plt.ylim(embedding[1].min() - .03 * embedding[1].ptp(),
         embedding[1].max() + .03 * embedding[1].ptp())

In [None]:
plt.show()