In [1]:
import altair as alt
import pandas as pd

# https://www.kaggle.com/open-source-sports/baseball-databank/version/2?select=Batting.csv

source = pd.read_csv("batting.csv")

alt.data_transformers.disable_max_rows()

alt.Chart(source.sample(n=1000)).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color='Origin:N'
).properties(
    width=150,
    height=150
).repeat(
    row=['HR', 'RBI', 'SO'],
    column=['SO', 'RBI', 'HR']
).interactive()

### Understanding covariance matrixes 
Often, we don't have only two variables or features. Instead we have vectors of observations in $K$ dimensions. Let $X_i$ be a a vector of observations consisting of $x_1, x_2, ... x_n$, and let $X_j$ be another such vector. If we have $X_N$ such vectors, then we can imagine that all of the 3rd components in the vectors are a sequence of observations. We can imagine that all of the 4th components in the vectors are another sequence of observations. We can then measure the covariance between these two sequences of observations. If we compute covariance for all pairs of sequences, we get a covariance matrix. We will discuss this on the whiteboard.

In [20]:
source[["HR", "SO", "RBI"]]

Unnamed: 0,HR,SO,RBI
0,0.0,0.0,0.0
1,0.0,0.0,13.0
2,0.0,5.0,19.0
3,2.0,2.0,27.0
4,0.0,1.0,16.0
...,...,...,...
101327,0.0,0.0,0.0
101328,6.0,26.0,33.0
101329,7.0,30.0,23.0
101330,11.0,132.0,28.0


In [69]:
import numpy as np

# See docs. https://numpy.org/doc/stable/reference/generated/numpy.cov.html
# note the transpose operation https://en.wikipedia.org/wiki/Transpose
_small = source[["HR", "SO", "RBI"]].sample(n=100).fillna(value=0) 

small_reshaped = _small.T

np.cov(small_reshaped, bias=True)

array([[ 45.2744, 161.9616, 168.2028],
       [161.9616, 922.9324, 660.3592],
       [168.2028, 660.3592, 791.7436]])

In [70]:
np.var(_small["SO"])  # notice the elements on the diagonal

922.9324000000004

1. Which components are positively correlated?
2. Which variables have the largest variance? Does that match what you observe on the plots?

### If time

Try it on your own data!