### Presentation requirements

- Send me a plan as early as possible.
- Send me a draft notebook by Sunday evening. 
- Send me the final corrected notebook Tuesday. 
- Indicate sources and your work in the notebook.
- Include examples (toy and MSS) in the notebook.
- Your work should be focused on the MSS dataset. 

### Presentation grades


Your grade is based on your knowledge of the techniques, code and MSS data you present, and on your abiliy to respond to questions. 

### Target variables

- Song and artist hotness (`artist_hotttnesss`, `song_hotttnesss`)

In [1]:
import pandas as pd
save_load_path = '/Users/David/Desktop'
mss_df = pd.read_pickle(save_load_path+'/mss_df.pkl')
mss_df.dtypes

artist_familiarity    float64
artist_hotttnesss     float64
artist_id              object
artist_latitude       float64
artist_location        object
artist_longitude      float64
bc_0                  float64
bc_1                  float64
bc_2                  float64
bc_3                  float64
bc_4                  float64
bc_5                  float64
bc_6                  float64
bc_7                  float64
bc_8                  float64
bc_9                  float64
duration              float64
loudness              float64
mode                  float64
release                object
song_hotttnesss       float64
song_id                object
sp_0                  float64
sp_1                  float64
sp_10                 float64
sp_11                 float64
sp_12                 float64
sp_13                 float64
sp_14                 float64
sp_15                 float64
                       ...   
st_33                 float64
st_34                 float64
st_35     

### Principal Component Analysis

- Performed the correlation or covariance matrices.
- Uses eigenvalues and eigenvectors

### Source: Python Data Science Cookbook by Gopi Subramanian (Packt Publishing)

PCA is done using the following steps:

1. Standardize the dataset to have a zero mean value.
1. Find the correlation matrix for the dataset and unit standard deviation value.
1. Reduce the correlation matrix matrix into its eigenvectors and values.
1. Select the top `n` eigenvectors based on the eigenvalues sorted in descending order.
1. Project the input eigenvectors matrix into the new subspace.

In [3]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import scale
import scipy
import matplotlib.pyplot as plt

### Load the `iris` data

In [13]:
data = load_iris()
x = data['data']
y = data['target']
print('y:')
print(y)
print('x:')
print(x)

y:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
x:
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2

### PCA is an unsupervised method (the target variable is not used)

### Scale the data such that mean = 0 and standard deviation = 1



In [14]:
x_s = scale(x, with_mean=True, with_std=True, axis=0)

### Calculate the correlation matrix

In [15]:
x_c = np.corrcoef(x_s.T)

### Find eigen value and eigen vector from correlation matrix

In [16]:
eig_val, r_eig_vec = scipy.linalg.eig(x_c)
print('Eigen values \n%s' % (eig_val))
print('\n Eigen vectors \n%s' % (r_eig_vec))

Eigen values 
[ 2.91081808+0.j  0.92122093+0.j  0.14735328+0.j  0.02060771+0.j]

 Eigen vectors 
[[ 0.52237162 -0.37231836 -0.72101681  0.26199559]
 [-0.26335492 -0.92555649  0.24203288 -0.12413481]
 [ 0.58125401 -0.02109478  0.14089226 -0.80115427]
 [ 0.56561105 -0.06541577  0.6338014   0.52354627]]


### Select the first two eigen vectors

In [18]:
w = r_eig_vec[:,0:2]
w

array([[ 0.52237162, -0.37231836],
       [-0.26335492, -0.92555649],
       [ 0.58125401, -0.02109478],
       [ 0.56561105, -0.06541577]])

### Project the dataset in to the dimension
### from 4 dimension to 2 using the right eignen vector

In [None]:
x_rd = x_s.dot(w)

### Scatter plot the new two dimensions

In [19]:
plt.figure(1)
plt.scatter(x_rd[:,0],x_rd[:,1],c=y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")

<matplotlib.text.Text at 0x10c0636a0>

### Techniques

- Regressions
- K-nearest neighbors
- Cluster analysis
- Association rules

"The core idea in this chapter is to augment/replace the vector of inputs X with additional variables, which are transformations of X, and then use linear models in this new space of derived input features." --- EoSL

Kernel function $K$ is an translation into a higher dimensional space followed by an inner product.

Polynomial kernel:

- $K(x_1, x_2) = (x \cdot y + c)^d$ with $c>0$