<h1> Dimensionality Reduction: Introduction </h1>


<h2> Project Proposal </h2>


The objective of this project is to reduce the tabular data's dimensionality, breaking down the range of disparate values into vectors which combine the meaninfgul signal contained in the data sources I have aggregated. This can enable faster processing of the data in other analyses such as decision trees or regression models, and it will also enable better performance by excluding noise from the data where possible. In terms of tools: I will use Python's scikit learn and its sub-package scikit learn.metrics, particularly the PCA and tSNE methods, to analyze the data and attempt dimensionality reduction. 

The record dataset I will be analyzing for dimensionality reduction includes: 1 - The daily level of retail activity for the top 10 most active stock tickers, 2 - The daily change in the level of retail activity for the top 10 most active stock tickers, 3 - Weekly individual investor survey data (columns for the percent of respondents that were bearish, bullish, or neutral), 4 - Weekly change in major stock indices prices, 4 - weekly stocktwits rankings of the most active stock tickers.

The dataset described above was selected by using all of the available record data that I gathered for this project. This is because data pertaining to retail investor sentiment and trading is exceptionally scarce online, and so it makes sense to use all available data for topics such as dimensionality reduction. Text data was not used due to its fundamental differences from the record data, and computational requirements given the 10,000 columns.


<h1> Code Implementation: </h1>


In [None]:
import json
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_samples, silhouette_score
import pandas as pd


# Read data in:


In [None]:



#--------------------------------------
#USER PARAM 
#--------------------------------------

NDIM                    =   3       #DIMENSION OF DATA
L                       =   50      #BOX SIZE (DATA BOUNDS)
NPOINTS=10000000    #int(np.random.uniform(low=100, high=300, size=1))


#MEAN VECTOR 
u=0*np.random.uniform(-L,L,size=NDIM); 

#COVARIANCE MATRIX 
#cov(xi,xj) --> symetric
#STD DEV VECTOR
s=np.random.uniform(0.2*L,L/10,size=NDIM); 
cov = np.random.uniform(-L/10,L/10,size=(NDIM,NDIM))
#FILL MAIN DIAG WITH STD DEV
np.fill_diagonal(cov, s, wrap=False)
#FORCE MATRIX TO BE POSITIVE SEMI-DEFINITE
cov = np.dot(cov, cov.transpose())
print('EXACT MEAN:',u)
print("EXACT COV:")
print(cov)

# SAMPLE
X = np.random.multivariate_normal(u, cov, NPOINTS)
print('\nNUMERIC MEAN:',np.mean(X,axis=0))
print("X SHAPE",X.shape)
print("NUMERIC COV:")
print(np.cov(X.T))

# EIGEN VALUES/VECTOR
from numpy import linalg as LA
# w, v1 = LA.eig(cov)
w, v1 = LA.eig(np.cov(X.T))
print("\nCOV EIGENVALUES:",w)
print("COV EIGENVECTORS (across rows):")
print(v1.T)

# X = np.random.multivariate_normal(u, cov, NPOINTS)

# PCA CALCULATION
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(X)
print('\nPCA')
print(pca.components_)
# v2=pca.components_

# print(v1/v2)

# # PLOT
# fig = plt.figure()
# ax = fig.add_subplot(projection='3d')
# ax.scatter(X[:,0],X[:,1],X[:,2],marker=".", cmap="viridis")
# v1=v1*1000
# # v2=v2*1000

# ax.quiver(0,0,0,v1[0,0],v1[1,0],v1[2,0])
# ax.quiver(0,0,0,v1[0,1],v1[1,1],v1[2,1])
# ax.quiver(0,0,0,v1[0,2],v1[1,2],v1[2,2])

# # ax.quiver(0,0,0,v2[0,0],v2[1,0],v2[2,0])
# # ax.quiver(0,0,0,v2[0,1],v2[1,1],v2[2,1])
# # ax.quiver(0,0,0,v2[0,2],v2[1,2],v2[2,2])
# plt.show()


<h1> Project Report </h1>

