Let's find and load our data!

In [None]:
# New data! We will use this dataset from Kaggle: https://www.kaggle.com/akiboy96/spotify-dataset
import numpy as np

columns=['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo','duration_ms','time_signature','chorus_hit','sections','popularity']
data = np.array(np.genfromtxt('data/spotify_dataset.csv', delimiter=',', skip_header=1, usecols=(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17), dtype=float, encoding='utf-8'))  

Let's look at our data!

In [None]:
import pandas as pd

def getSummaryStatistics(data):
    return np.array([data.max(axis=0), data.min(axis=0), data.mean(axis=0), data.var(axis=0)])

def getShapeType(data):
    return (data.shape, data.dtype)

# I want to see it better!
print(pd.DataFrame(getSummaryStatistics(data)))
getShapeType(data)

In [None]:
# let's look at correlations of variables

def getCorrelationsPairwise(data):
    for i in range(len(columns)):
        print(columns[i], np.corrcoef(data[:, 0], data[:, i], rowvar=True)[0,1])
        
getCorrelationsPairwise(data)

In [None]:
import plotly.express as px
import pandas as pd

# which columns do we want to keep for visualization?
columnsToKeep = [0,]
dataSubset = data[np.ix_(np.arange(data.shape[0]), columnsToKeep)]
df = pd.DataFrame(dataSubset, columns=[columns[i] for i in columnsToKeep])

# pair plot
fig = px.scatter_matrix(
    df,
    dimensions=[columns[i] for i in columnsToKeep],
    color="danceability"
)
fig.update_traces(diagonal_visible=False)
fig.show()

# hmm, let's try a scatter plot
# what do we want for x, color and size?
fig = px.scatter(df, x="", y="danceability", color="",
                 size="")
fig.show()

# I mean, there's something there, but it's not at all clear, right??

# Covariance

As we can see from the correlation analysis above, features (variables) in a data set may be related to each other.

The *covariance matrix* of a data set tells us about the first order relationships between different features. If $A$ is the matrix corresponding to our data set of $N$ data points for each of which we have $M$ features, and $\bar{A_i}$ is the mean of the $i$th feature, then covariance matrix $C$ is:
    $$C_{i,j} = \sum_{k=1}^N \frac{(A_{k,i} - \bar{A_i})(A_{k,j} - \bar{A_j})}{N-1}$$
    
The covariance matrix has the variance of each feature along its diagonal, and the remaining entries are the *covariances* of pairs of features, ie how much they vary together. If they vary together, then they are related to each other - some information is shared between them.

Questions:
* If the covariance is close to 0, then what is true of the pair of features?
* If the covariance is big, then what is true of the pair of features?
* Correlations can be positive or negative; what about covariances?

We can calculate covariance using matrix multiplication:
* First, center the data: $A_c = A - \bar{A}$
* Then, calculate $C$: $C = \frac{1}{N-1} A_c^TA_c$

Let's look at some toy examples (h/t Stephanie Taylor!)

These two variables co-vary.

In [None]:
A = np.array([[1,2,3,0], [1.1, 2.1, 3, 0.5]]).T
fig = px.scatter(pd.DataFrame(A, columns=['X', 'Y']), x="X", y="Y")
fig.show()

# First, center the data

# Then, calculate C

# What do we observe about the variance of X? of Y? What about the covariance?

This is just like before by the values of $Y$ are now negative.

In [None]:
A = np.array([[1,2,3,0], [-1.1, -2.1, -3, -0.5]]).T
fig = px.scatter(pd.DataFrame(A, columns=['X', 'Y']), x="X", y="Y")
fig.show()

# First, center the data

# Then, calculate C

# What do we observe about the variance of X? of Y? What about the covariance?

The values of $X$ and $Y$ are now random.

In [None]:
A = np.array([np.random.standard_normal(4), np.random.standard_normal(4)]).T
fig = px.scatter(pd.DataFrame(A, columns=['X', 'Y']), x="X", y="Y")
fig.show()

# First, center the data

# Then, calculate C

# What do we observe about the variance of X? of Y? What about the covariance?

Okay, now let's try this on our data!

In [None]:
# Let's start with just a subset of our data
A = dataSubset

# First, center the data

# Then, calculate C

# I want to see it better!
print(pd.DataFrame(C))

# What do we observe about the variance of X? of Y? What about the covariance?

