# Exploratory Data Analysis

The goal of today's discussion is to give some insights about exploring high dimensional datasets. We will look into two popular methods:  

* Principal Component Analysis 
* Latent Semantic Analysis


## Principal Component Analysis

## Text Analysis

In the first half of this workshop, we'll be using a very important supervised machine learning algorithm called the **support vector machine** to classify handwritten digits. This is a very well-studied problem in the machine learning community, and serves as a great starting point. First, let's import scikit-learn and a couple other modules we'll need.

In [1]:
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.spatial.distance import cosine

Now, sklearn comes with a few preloaded datasets, so let's load up the handwritten digits dataset. This is a list of pixel intensities corresponding to images of handwritten digits plus their associated labels (0-9).

In [2]:
corpus = [
"Romeo and Juliet.",
"Juliet: O happy dagger!",
"Romeo died by dagger.",
"'Live free or die', that's the New-Hampshire's motto.",
"Did you know, New-Hampshire is in New-England."
]
print corpus


['Romeo and Juliet.', 'Juliet: O happy dagger!', 'Romeo died by dagger.', "'Live free or die', that's the New-Hampshire's motto.", 'Did you know, New-Hampshire is in New-England.']


Let's use matplotlib to see what one of these images looks like

In [3]:
preprocessed_corpus = [
"Romeo and Juliet",
"Juliet O happy dagger",
"Romeo die by dagger",
"Live free or die that the NewHampshire motto",
"Did you know NewHampshire is in NewEngland"
]

key_words = ['die', 'dagger']

stop_words = ["the", "and", "in", "by", "or", "did", "you", "is", "that"]

print stop_words



['the', 'and', 'in', 'by', 'or', 'did', 'you', 'is', 'that']


By default, matplotlib plots each value on a color scale. We can convert this to greyscale to get a better idea of the actual image.

In [4]:
# vectorizer = CountVectorizer(min_df=0, stop_words=stop_words, strip_accents='ascii')
vectorizer = CountVectorizer(min_df=0, stop_words=None, strip_accents='ascii')

docs_tf = vectorizer.fit_transform(preprocessed_corpus)
docs_query_tf = vectorizer.transform(preprocessed_corpus + [' '.join(key_words)])

# analyze = vectorizer.build_analyzer()
vocabulary_terms = vectorizer.get_feature_names()

print vocabulary_terms


[u'and', u'by', u'dagger', u'did', u'die', u'free', u'happy', u'in', u'is', u'juliet', u'know', u'live', u'motto', u'newengland', u'newhampshire', u'or', u'romeo', u'that', u'the', u'you']


### TF-IDF


In [5]:
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(docs_query_tf.toarray())
tfidf_matrix = tfidf.toarray()[:-1] # D x V matrix
query_tfidf = tfidf.toarray()[-1]

print 
print 'Query:', ' '.join(key_words)
print 
print '--------------------- Rank list ----------------------'
query_doc_tfidf_cos_dist = [cosine(query_tfidf, doc_tfidf) for doc_tfidf in tfidf_matrix]
query_doc_tfidf_sort_index = np.argsort(np.array(query_doc_tfidf_cos_dist))

for rank, sort_index in enumerate(query_doc_tfidf_sort_index):
    print rank, query_doc_tfidf_cos_dist[sort_index], corpus[sort_index]
    


Query: die dagger

--------------------- Rank list ----------------------
0 0.434542044158 Romeo died by dagger.
1 0.69154101474 Juliet: O happy dagger!
2 0.837128775958 'Live free or die', that's the New-Hampshire's motto.
3 1.0 Romeo and Juliet.
4 1.0 Did you know, New-Hampshire is in New-England.


### Latent Semantic Analysis

In [6]:
tf_matrix = docs_tf.toarray() # D x V matrix 
A = tf_matrix.T # V x D matrix 

U, s, V = np.linalg.svd(A, full_matrices=1, compute_uv=1)


U - the matrix of the eigenvectors of C = AA' (the term-term matrix). 
    it's a V x V matrix 
V - the matrix of the eigenvectors of B = A'A (the document-document matrix). 
    it's a D x D matrix 
s - the singular values, obtained as square roots of the eigenvalues of B.


Video describing C: https://www.youtube.com/watch?v=joTa_FeMZ2s

Video describing gamma: https://www.youtube.com/watch?v=m2a2K4lprQw

That's pretty impressive. So what are those mysterious values gamma and C? The C parameter controls the penalty for misclassification of each example in the training data. Large values of C highly penalize misclassification, and thus will fit to the training data more exactly. However, this can lead to overfitting and trouble with outliers, in which case a smaller value of C should be chosen.

The gamma parameter is somewhat more complicated, but it can be understood to be the radius of influence of the individual support vectors. More info can be found in the sklearn SVM documentation: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

So let's try changing the values of C and gamma and see what we get.

In [7]:
K = 3 # number of components

A_reduced = np.dot(U[:,:K], np.dot(np.diag(s[:K]), V[:K, :]))

# print A_reduced.shape


terms_rep = np.dot(U[:,:K], np.diag(s[:K])) # V x K matrix 

key_word_indices = []
for key_word in key_words:
    key_word_indices.append(vocabulary_terms.index(key_word))

docs_rep = np.dot(np.diag(s[:K]), V[:K, :]).T # D x K matrix 

             
key_words_rep = terms_rep[key_word_indices,:]     

# Now the query is represented by a vector computed as the centroid of the 
# vectors for its terms.
# In our example, the query is die, dagger and so the vector is              
                  

query_rep = np.sum(key_words_rep, axis = 0)

print query_rep



print 
print 'Query:', ' '.join(key_words)
print 
print '--------------------- Rank list ----------------------'
query_doc_cos_dist = [cosine(query_rep, doc_rep) for doc_rep in docs_rep]
query_doc_sort_index = np.argsort(np.array(query_doc_cos_dist))

for rank, sort_index in enumerate(query_doc_sort_index):
    print rank, query_doc_cos_dist[sort_index], corpus[sort_index]


Query: die dagger

--------------------- Rank list ----------------------
0 0.0479493579768 Romeo died by dagger.
1 0.195476662549 Romeo and Juliet.
2 0.195476662549 Juliet: O happy dagger!
3 0.449474164103 'Live free or die', that's the New-Hampshire's motto.
4 0.977968740393 Did you know, New-Hampshire is in New-England.
