In [None]:
#Install the necessary packages

!pip install yfinance
!pip install matplotlib==3.5.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#Standard packages
import numpy as np
import pandas as pd

#Dates
from datetime import datetime, timedelta

#Finance packages
import yfinance as yf

#Statistics
from scipy.stats import t
from scipy.stats import skew, kurtosis

#Plotting packages
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

import seaborn as sns

from matplotlib import rcParams

rcParams["font.size"] = 20
rcParams["axes.labelsize"] = 30

rcParams["xtick.labelsize"] = 16
rcParams["ytick.labelsize"] = 16

rcParams["figure.figsize"] = (8,6)

#Exercise 1. Correlation matrix - Basic analysis

The correlation matrix of $N$ random variables stores the correlations among each pair of variables. In an investment scenario, we are interested in computing the correlation matrix of the assets under consideration. As we have seen so far, the assets are characterized by the time-series of its returns, so we will compute the correlation among the return time series.

Given $N$ assets, with their returns given by $Y_i(t_k)$  with $t_k=k\Delta t; \ k=1,\dots,n$; $i=1,\dots, N$, we define the elements of the correlation matrix as

\begin{equation}
C_{ij}=\frac{1}{n}\sum_{k=1}^n\tilde{Y}_i(t_k)\tilde{Y}_j(t_k) \quad \textrm{with} \quad \tilde{Y}_i(t_k)=\frac{Y_i(t_k)-\mu_{Y_i}}{\sigma_{Y_i}}
\end{equation}

where $\mu_{Y_i}=\left<Y_i\right>_T$ is the mean of the return time series and $\sigma_{Y_i}=\sqrt{\left<Y_i^2\right>_T-\left<Y_i\right>_T^2}$

Let's download stock price data for different assets and compute the correlation matrix!


**1. Download the data**

In [None]:
start = '2013-01-01'
end = datetime.today().strftime('%Y-%m-%d')

#Apple, Microsoft, Amazon, Tesla, Google, Meta(Facebook), Telefonica, Indra, IBEX35
assets = ["AAPL", "MSFT", "AMZN",  "TSLA", "GOOGL", "META", "TEF.MC", "IDR.MC", "^IBEX"]

df = yf.download(assets, start=start, end=end, progress=False)["Adj Close"]

df

Unnamed: 0_level_0,AAPL,AMZN,GOOGL,IDR.MC,META,MSFT,TEF.MC,TSLA,^IBEX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-01-02,16.837120,12.865500,18.099348,9.520262,28.000000,22.717781,4.927617,2.357333,8447.590820
2013-01-03,16.624590,12.924000,18.109859,9.547592,27.770000,22.413450,4.918208,2.318000,8403.391602
2013-01-04,16.161520,12.957500,18.467718,9.584035,28.760000,21.993977,4.929968,2.293333,8435.791016
2013-01-07,16.066454,13.423000,18.387136,9.880119,29.420000,21.952847,4.915857,2.289333,8418.991211
2013-01-08,16.109695,13.319000,18.350851,9.520262,29.059999,21.837696,4.901743,2.245333,8452.991211
...,...,...,...,...,...,...,...,...,...
2023-04-25,163.770004,102.570000,103.849998,12.260000,207.550003,275.420013,3.971000,160.669998,9290.299805
2023-04-26,163.759995,104.980003,103.709999,12.300000,209.399994,295.369995,4.067000,153.750000,9293.700195
2023-04-27,168.410004,109.820000,107.589996,12.340000,238.559998,304.829987,4.119000,160.190002,9314.599609
2023-04-28,169.679993,105.449997,107.339996,12.000000,240.320007,307.260010,4.125000,164.309998,9241.000000


**2. Compute the log returns**

In [None]:
df_returns = #CODE

SyntaxError: ignored

**3. Compute the correlation matrix of the returns**

- Complete the function below to compute the autocorrelation matrix of a given DataFrame.

- Plot the results using seaborn `sns.heatmap()` function. What do you observe?

**Note:** *An efficient way of computing the correlation matrix is just using the `DataFrame.corr()`method. But it is useful to code it by hand at least once.*

In [None]:
def compute_correlation_matrix(df):

  N = len(df.columns)

  p = np.identity(N)

  for i in range(N):

    for j in range(i+1,N):

      #CODE

      p[i,j] = #CODE

      p[j,i] = p[i,j] #We already know that the correlation matrix is symmetric

  corr_mat = pd.DataFrame(p, columns=df.columns, index=df.columns)

  return corr_mat


In [None]:
corr_mat = compute_correlation_matrix(df_returns)

if np.sum(np.round(df_returns.corr().values, 4) == np.round(corr_mat.values, 4)) == len(df_returns.columns)**2:

  print("Test passed!")

else:

  print("Something went wrong...")

In [None]:
#PLOT

#CODE

**4. Compute the maximum and minimum correlations. Which are stocks having max and min correlations**

**Clue:** *Use the np.unravel_index(idx, matrix_shape) function to transform a flattened index to a cartesian one*

In [None]:
#CODE

**5. Compute the mean, variance and deviation of the correlations**

The deviation between each pair of assets with respect to the average correlation can be stored in another matrix, whose elements are given by

\begin{equation}
\delta_{ij}=\frac{C_{ij}-\mu_C}{\sigma_C}
\end{equation}

* Implement a function that return a DataFrame with the deviations, so that it can be plotted nicely with seaborn
* Plot the deviation matrix

**Clue:** *Use the `numpy.mean(x, axis)`, `numpy.var(x, axis)` and `numpy.std(x, axis)` methods*

In [None]:
#CODE

# Exercise 2. Distribution of pair correlations

**IMPORTANT:** *yfinance package returns the downloaded data in alphabetical order, so if the order of our ticker list is not alphabetical and is correlated with some other array, it will all mess up!*

- So it is good practice to build the ticker list and any correlated information in alphabetical order previous to downloading the data.

In [None]:
import requests
import bs4 as bs

resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

soup = bs.BeautifulSoup(resp.text, 'lxml')

table = soup.find('table', {'class': 'wikitable sortable'})

tickers = []
categories = []

for row in table.findAll('tr')[1:]:

    ticker = row.findAll('td')[0].text
    tickers.append(ticker)

    category = row.findAll('td')[2].text
    categories.append(category)

tickers = [s.replace('\n', '') for s in tickers]

categories = np.array(categories)

sorted_idxs = np.argsort(tickers)

tickers = np.array(tickers)[sorted_idxs].tolist()

categories = categories[sorted_idxs]

start = datetime(2000,1,1)
end = datetime(2023,1,1)

print("This will take a while...")
df = yf.download(tickers, start=start, end=end)["Adj Close"].dropna(axis=1)

bools = np.array([True  if (ticker in df.columns.values) else False for ticker in tickers])

categories = categories[bools]

df_returns = np.log(1+df.pct_change()).dropna()

df_returns.index = df_returns.index.tz_localize(None)

df_returns

**1. Annual distribution of correlations**

- Compute the correlation matrix for each year in the downloaded period
- Plot the distribution of the correlations for each year (it would be nice to make an animation)

What do you observe?

In [None]:
#CODE

**2. Plot the mean and volatility (standard deviation) of the annual distribution for each year**

In [None]:
#CODE

# Exercise 3. Asset graph

A graph (or network) is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called vertices (also called nodes or points) and each of the related pairs of vertices is called an edge (also called link or line). Typically, a graph is depicted as a set of circles for the vertices joined by lines or curves for the edges.

The edges may be directed or undirected. For example, if the vertices represent people at a party, and there is an edge between two people if they shake hands, then this graph is undirected because any person A can shake hands with a person B only if B also shakes hands with A. In contrast, if an edge from a person A to a person B means that A owes money to B, then this graph is directed, because owing money is not necessarily reciprocated.

The network can be weighted or unweighted, this is, the links can have different values corresponding to the "strength" of the interactions or only take binary values (0:no link, 1:link), indicating the existence or not of a relation.

**NetworkX** is the main Python library for working with networks (https://networkx.org/)

In our case, the asset graph is an undirected weighted graph in which the nodes correspond to different assets and the links are the correlation coefficients among them.

**1. Complete the function below to build the asset graph and plot the result**

Here I already provide all the code for the plot, but feel free to explore in the documentation of NetworkX for other layaouts and other settings! Later you will have to do the network plots yourselves (just copy the code here and re-use it)

**Clue:** *Use the `G.add_edge(node_1_idx, node_2_idx, weight=w)` method*

In [None]:
import networkx as nx

def asset_graph(corr_mat, threshold):

  #Set number of nodes
  N = corr_mat.shape[0]

  #Build an empty graph
  G = nx.Graph()

  #Add nodes
  G.add_nodes_from(np.arange(0, N, 1))

  #Add links between each pair of nodes if some criteria is met (in this case, p_ij>threshold)

  #CODE

  return G

In [None]:
threshold = 0.5

#Compute correlation matrix
corr_mat = df_returns.corr()

#Build network with given threshold value
G = asset_graph(corr_mat.values, threshold)

#Obtain the edges and their weights from the network
edges, weights = zip(*nx.get_edge_attributes(G, 'weight').items())

#Names of assets (nodes)
labels = dict(zip(np.arange(0, corr_mat.shape[0], 1), corr_mat.columns.values))

#Add color for each category
color_code = np.copy(categories)

unique_categories = np.unique(categories)

cmap = mpl.cm.get_cmap('tab20', len(unique_categories))

colors = [mpl.colors.rgb2hex(cmap(i)[:3]) for i in range(cmap.N)]

for i in range(len(unique_categories)):

  color_code[categories == unique_categories[i]] = colors[i]

#Plot
plt.figure(figsize=(14, 12))

nx.draw(G, pos = nx.kamada_kawai_layout(G), edgelist=edges, node_color=color_code, edge_color=weights, edge_cmap=plt.cm.Reds, with_labels=True,
        labels=labels, font_size=20)

legend_elements = [Line2D([0], [0], ls="", marker='o', color=colors[i], label=unique_categories[i], markerfacecolor=colors[i], markersize=15)
 for i in range(len(unique_categories))]

plt.legend(handles=legend_elements, loc="upper right", ncol=2, bbox_to_anchor=(1.45, 1))

# Exercise 4. Distances, Minimum Spanning Tree & Hierarchical Tree

The correlations between each pair of assets allow to define a distance (in the mathematical sense) between them. It can be shown that this distance can be easily computed as

\begin{equation}
d_{ij}=\sqrt{2(1-C_{ij})}
\end{equation}

Thus, the more correlation between a pair of assets, the less distance between them.

The concept of distance between assets, in turn, allows to compute the **Minimum Spanning Tree (MST)** of the asset graph.

In graph theory, a **tree** is a graph in which every pair of nodes is connected by only one path. A **spanning tree** of a given graph is a *tree* that connects all its nodes. Given a *weighted graph*, its **MST** is the *spanning tree* that minimizes the sum of its edge weights.

Thus, to compute the **MST** of the asset graph we first have to compute the distance matrix and use this distance matrix to construct the asset graph (fully connected, no threshold needed now). Finally, we can easily compute the **MST** using NetworkX.

From the distance matrix we can also compute the **Hierarchical tree** associated to the **MST** using SciPy.

**1. Create a function to compute the distance matrix**

In [None]:
#CODE

**2. Create a function to create a fully connected and weighted network from the distance matrix (nodes are connected to each other with a weight corresponding to the distance between them).**

**Note:** *Modify the `asset_graph` function implemented before*

In [None]:
#CODE

**3. Compute the Minimum Spanning Tree**

To do so, you first have to compute the distance matrix and then generate the fully connected and weighted network. From this network you can compute the MST with `networkx.minimum_spanning_tree(G)` method.

In [None]:
#CODE

**2. Compute the associated hierarchical tree**

- Convert the vector-form distance matrix to a square-form distance matrix using `scipy.spatial.distance.squareform(x)` method

- Create the hierarchical tree using `scipy.cluster.hierarchy.dendogram(matrix, method, metric) ` method using `ward` method and `euclidian`distance

- Plot the tree using the `scipy.cluster-hierarchy(linkage_data)` method

In [None]:
#CODE