# General Goals
The purpose of this notebook is to first perform covariance PCA on N novels, with the additional option of decomposing any given novel into a series of "chunk" subsets of W words each. Gibbs sampling is then used to identify probability distribution densities for each novel. Novels can be clustered empirically from the overlap of these probability densities. Special attention will be paid to how the size of the subset affects the probability density for each novel. 

NOTE: Older than PCA-variance notebook. (What is contained here is defined as functions there.) 



# Notebook Status: 

## LOOP 0 (PREPROCESSING/ TOKENIZING / CHUNCKING):
NOTE: Needs to be in same folder as documents to process.   
  - WHOLE TEXTS (i.e., option = 0): OK
  - CHUNKED TEXTS (i.e., option =1): unverified (2/1/2017)
  

## LOOP 1 (PCA of whole and chunks):
  - PCA for whole text
  - PCA for N chunks
 
## LOOP 2 (GIBBS SAMPLING ON TEXTS):
  - Broken

In [1]:
#FOR PCA
import os
from os import walk, getcwd, listdir
import numpy as np
import nltk  #for tokenizer
import pandas as pd
from sklearn.decomposition import PCA as sklearnPCA
import matplotlib.pyplot as plt
import matplotlib #to set rcParams to change graph size
import time

#FOR GIBBS
from random import randint
import scipy.stats
#import math as m

%matplotlib inline

In [2]:
## PARAMETERS (Those not yet implemented in code are commented out)
np.random.seed(1) #seed random number generator so "random" results are repeatable

## Chunking Parameters ----------------------
chunk_corpus = 1            # 0 = don't chunk texts; 1 = do chunk texts
chunk_size_used = 5000    # number of words per chunk (but will give >chunk_size final word chunks)
#chunk_size_interval = 100 # interval of chunking when doing experiments with multiple chunk sizes; 
#number_of_chunk_size_experiments = 10  # set equal to chunk_size_used if you only want to do experiment for one chunk size

## Word Count Parameters --------------------
#number_of_MFW_window_experiments = 1 #If set equal to 1, just uses "Number_of_MFWs_used"
#Number_of_MFWs_used = 25  #number of MFWs used to plot PCA graphs
#MFW_interval = 100        # interval increase of MFW window used when doing experiments w/...
                           # multiple MFW windows

In [3]:
## FUNCTIONS USED TO GET WORD FREQUENCIES FROM TEXTS
def count_words(list_to_search): #uses single_type_count() to count each token
    unique_words = set(list_to_search)
    word_counts = {}
    for word in unique_words:
        word_counts[word] = single_type_count(word, list_to_search)
    return word_counts # dict w/ word counts

def single_type_count(token_to_count, list_to_search): #counts up all tokens of a type
    number_of_tokens = 0                            
    for token in list_to_search:                   
        if token == token_to_count:                 
            number_of_tokens += 1                   
    return number_of_tokens #returns int 

def total_number_of_words(dict_of_word_counts): #for use with token_counts
    number_of_words = 0
    for word in dict_of_word_counts:
        number_of_words = number_of_words + dict_of_word_counts[word]
    return number_of_words #returns int

def total_number_of_words_in_corpus(list_of_total_word_counts):
    total_number_of_words = 0
    for total in range(0,len(list_of_total_word_counts)):
        total_number_of_words = total_number_of_words + list_of_total_word_counts[total]
    return total_number_of_words #returns int

def get_word_frequencies(dict_of_words_with_counts, total_number_of_words_in_text):
    word_freq = {}
    for word in dict_of_words_with_counts:
        word_freq[word] = dict_of_words_with_counts[word]/total_number_of_words_in_text
    return word_freq # dict with word w/ normalized frequencies

## FUNCTION TO CHUNK A TEXT OF ANY SIZE
# Note that "wordlist" needs to be a list of words *in the order* of the original text;
# also, chunk_text returns the number of chunks used for a particular text in addition
# to a list of text chunks
def chunk_text(wordlist, words_per_chunk): #chunk text; where last chunk <= words_per_chunk
    if words_per_chunk == 0:
        return
    number_of_chunks, size_of_final_chunk = divmod(len(wordlist), words_per_chunk)
    number_of_chunks = number_of_chunks + 1 #we add one because divmod does include final chunk
    temp = [] #We build a list of chunks using this list
    list_of_text_chunks = [temp + wordlist[words_per_chunk*chunk:words_per_chunk*(chunk+1)] for chunk in range(0, number_of_chunks)]
    return list_of_text_chunks, number_of_chunks

In [4]:
## READ ALL DOCUMENTS IN DIRECTORY AS STRINGS IN A LIST
alltexts = [ filename for filename in listdir(getcwd()) if filename.endswith('.txt')] #get filenames in current directory ending in ".txt"
wordlist = [] #each element is the word list for alltexts[document]
novelnames = []  #each element is the name of the text

for document in range(0, len(alltexts)):                   # Loop through all files in directory
    root, ext = os.path.splitext(alltexts[document])       # Select file extension for particular file "x" in the list "allFilesInDirectory"
    if (ext == '.txt'):    
        text = open(str(alltexts[document]), "r").read()   #read file
        novelnames.append(root)                            #create list of text names
        
## TOKENIZE WORDS FOR ALL DOCUMENTS, i.e., [[text1_words],[text2_words],...]
        text = text.lower()
        wordlist.append(nltk.tokenize.word_tokenize(text))     #tokenize file (a poor tokenizer)
        #wordlist.append(text.lower().split())                 #tokenize file (a terrible tokenizer)  

In [5]:
## BREAK UP TOKENIZED TEXTS INTO CHUNKS
# The following code produces two lists: (1) a list of word chunks for the entire corpus
# (e.g., [[word_chunk_list_1], [word_chunk_list_2],...]) 
# AND (2) a list of the starting and ending chunks associated with each text, where the 
# index number for each text is the same as the text in "novelnames" list
# (e.g., [[staring_chunk_#_for_text_1 , ending_chunk_#_for_text_1],...])

document_chunk_index = [] # a list such that document_chunk_index[0][0] is the index 
                          # for the 1st chunk of the 1st document and document_chunk_index[0][1]
                          # is index for the final chunk of the 1st document; will remain empty
                          # if no chunking is performed (i.e., if chunk_text = 0)

wordlist_of_chunks = []  # a list of chunk wordlists for entire corpus
  
if chunk_corpus == 1: #turn text chunking on or off in parameters section at beginning of code
    for document in range(0, len(alltexts)):        
            wordlists_for_all_chunks_in_a_text, number_of_chunks = chunk_text(wordlist[document], chunk_size_used)
            wordlist_of_chunks = wordlist_of_chunks + wordlists_for_all_chunks_in_a_text #.concatenate(wordlists_for_all_chunks_in_a_text) #add chunks to corpus list of chunks
            if document == 0:
                document_chunk_index = [[0, number_of_chunks-1]]
            else:
                document_chunk_index.append([document_chunk_index[document-1][1]+1,document_chunk_index[document-1][1]+number_of_chunks])
                


In [None]:
##TESTS FOR WORD CHUNKING USING 3 WOOLF NOVELS
#print("wordlist_of_chunks")                
#print(wordlist_of_chunks)
#print("length of wordlist of chunks")
#print(len(wordlist_of_chunks))
#print("Number of words in first chunk")
#print(len(wordlist_of_chunks[0]))
#print("document_chunk_index")
#print(document_chunk_index)
#print("16, 56, 89")
#print(str(len(wordlist_of_chunks[16]))+ ", "+str(len(wordlist_of_chunks[56]))+", "+ str(len(wordlist_of_chunks[89])))

In [None]:
## CALCULATE WORD COUNTS, i.e., [{text1_word_counts}, {text2_word_counts},...]
t0 = time.clock() #grab the time; this used to track how long word counting takes

# FIRST, CALCULATE WORD COUNTS FOR INDIVIDUAL TEXTS IN CORPUS
word_counts = []
total_word_counts = []
for text in range(0, len(wordlist)):
    word_counts.append(count_words(wordlist[text]))
    total_word_counts.append(total_number_of_words(word_counts[text]))

# We also need to calculate total words in corpus
corpus_word_count = total_number_of_words_in_corpus(total_word_counts)
t1 = time.clock() #grab the time; this used to track how long word counting takes

# SECOND, IF CHUNKING TEXTS, CALCULATE WORD COUNTS FOR INDIVIDUAL CHUNKS BY REPEATING THE ABOVE PROCEDURE. 
word_counts_for_chunks = []
total_word_counts_for_chunks = [] 
corpus_word_count_for_chunks = 0
if chunk_corpus ==1:
        for chunk in range(0, len(wordlist_of_chunks)):
            word_counts_for_chunks.append(count_words(wordlist_of_chunks[chunk]))
            total_word_counts_for_chunks.append(total_number_of_words(word_counts_for_chunks[chunk]))

# We also need to calculate total words in corpus for chunks
corpus_word_count_for_chunks = 0 # We need this variable to be global so I arbitrarily initialize it to zero
if chunk_corpus ==1:
    corpus_word_count_for_chunks = total_number_of_words_in_corpus(total_word_counts_for_chunks)
t2 = time.clock() #grab the time; this used to track how long word counting takes

#NOTE: WORD COUNT FOR INDIVIDUAL TEXTS AND CHUNKS SHOULD BE THE SAME!!
if chunk_corpus ==1:
    if corpus_word_count != corpus_word_count_for_chunks:
        print("FATAL ERROR: Total word counts for texts and chunks are not equal!")

In [None]:
##TESTS FOR WORD COUNTING USING 3 WOOLF NOVELS
print("corpus_word_count: "+ str(corpus_word_count))                
print("corpus_word_count_for_chunks: " + str(corpus_word_count_for_chunks))
# Output time it takes to run code block
print("Time to count words of texts:"+ str((t1-t0)/60.0)+" minute(s)") 
print("Time to count words of chunks:"+ str((t2-t1)/60.0)+" minute(s)") 

In [None]:
## CALCULATE WORD FREQUENCIES (RELATIVE TO ENTIRE CORPUS)

# First, for individual texts...
word_frequencies = []
for text_word_counts in range(0, len(word_counts)):
    word_frequencies.append(get_word_frequencies(word_counts[text_word_counts],corpus_word_count))

#Second, for individual chunks...
word_frequencies_for_chunks = []
if chunk_corpus == 1: 
    for chunk_word_counts in range(0, len(word_counts_for_chunks)):
            word_frequencies_for_chunks.append(get_word_frequencies(word_counts_for_chunks[chunk_word_counts],corpus_word_count_for_chunks))


In [None]:
## FIRST, EXAMINE MFWs (OR LFWs) IN A READABLE FORMAT USING PANDAS DATAFRAMES FOR INDIVIDUAL TEXTS IN CORPUS
readable_word_frequencies = pd.DataFrame(word_frequencies).T
text_to_examine = 0 # column identifer for a particular text; full list in "novelnames"
MFW = readable_word_frequencies.sort_values([text_to_examine], ascending = False)
MFW = MFW.fillna(0)
#print(novelnames) #uncomment to display list of texts
#MFW.head(25) #Display first 25 words for MFW
#MFW #uncomment to display MFW relative to a single text; WARNING: this list is millions of lines

In [None]:
## SECOND, EXAMINE MFWs (OR LFWs) FOR INDIVIDUAL CHUNKS (IF chunk_corpus == 1)
if chunk_corpus == 1:
    readable_word_frequencies_for_chunks = pd.DataFrame(word_frequencies_for_chunks).T
    chunk_to_examine = 0 # column identifer for a particular text; full list in "novelnames"
    MFW_for_chunks = readable_word_frequencies_for_chunks.sort_values([chunk_to_examine], ascending = False) #sort using MFW of novelnames-text_to_examine] text
    MFW_for_chunks = MFW_for_chunks.fillna(0)
    #FOR TESTING
    #print(novelnames) #uncomment to display list of texts
    #print(document_chunk_index) #uncomment to display index for chunks for each text
    #print(MFW_for_chunks.head(25)) #Display first 25 MFWords for all novels in comparison to novelnames[text_to_examine]
    #MFW #uncomment to display MFW relative to a single text; WARNING: this list may be millions of lines

In [None]:
## TRANSFORM WORD FREQUENCIES LIST FOR PCA
# To do PCA with scikit-learn, we need a matrix where each word (i.e., each dimension) 
# gets its own nested list in which each text is an element in that word list
print("corpus word count:" + str(corpus_word_count))
Number_of_MFWs_used = 444564 #used in next line to pull N MFWs from MFW dataframe
MFWlist_array_for_PCA = (MFW.head(Number_of_MFWs_used).as_matrix()).T #convert from dataframe to numpy array (for texts)
MFWlist_array_for_PCA_for_chunks = (MFW_for_chunks.head(Number_of_MFWs_used).as_matrix()).T #convert from dataframe to numpy array (for chunks)

In [None]:
## FIRST, PERFORM PCA AND GENERATE PCA GRAPH FOR ENTIRE TEXTS -------------------------
## ------------------------------------------------------------------------------------

## (0) We want to autogenerate colors; one for each text and/or chunk. To do this, we'll
## take the built in dictionary in matplotlib. Sadly, we can't just iterate through 
## this list because it will pick colors that all look very similar so
## we choose values at random from this list, being sure to use no color twice... 

#Generate color scheme for text plots by picking colors randomly
graph_colors = list(matplotlib.colors.cnames.items()) #colors[0][0], colors[1][0],...colors[150][0] all return defined color names
colors_for_texts = [] 
numbers_used = []
for text in range(0, len(alltexts)):
    iterate = 1
    while iterate == 1:
        random_int = randint(0,150)  
        if random_int not in numbers_used:
            numbers_used = numbers_used + [random_int]
            colors_for_texts = colors_for_texts + [graph_colors[random_int][0]]
            iterate = 0
            
## (1) generate data points of PCA from MFW word list (this is where PCA is performed)
pca = sklearnPCA(n_components=2)
pca_coordinates = pca.fit_transform(MFWlist_array_for_PCA) #array of x- & y-coordinates 
PCA_coordinates_for_texts = pca_coordinates # will be gibbs sampled 

## (2) plot PCA data
for text in range(0, len(PCA_coordinates_for_texts)):
    plt.plot(pca_coordinates[text,0],pca_coordinates[text,1], 
        'o', markersize=7, color=colors_for_texts[text], alpha=0.5, label=novelnames[text])
#plt.plot(sklearn_pca_coordinates[0,0], sklearn_pca_coordinates[0,1],
#        'o', markersize=7, color=colors_for_texts[0], alpha=0.5, label=novelnames[0])
#plt.plot(sklearn_pca_coordinates[1,0], sklearn_pca_coordinates[1,1],
#         '^', markersize=7, color=colors_for_texts[1], alpha=0.5, label=novelnames[1])
#plt.plot(sklearn_pca_coordinates[2,0], sklearn_pca_coordinates[2,1],
#         '^', markersize=7, color=colors_for_texts[2], alpha=0.5, label=novelnames[2])

## (3) Select Graph Display Parameters
plt.xlabel('PC 1 ('+str(pca.explained_variance_ratio_[0]*100)+'%)') #x-axis title
plt.ylabel('PC 2('+str(pca.explained_variance_ratio_[1]*100)+'%)') #y-axis title
matplotlib.rcParams['figure.figsize'] = (12.0,12.0) #size of graph generated in notebook
#plt.xlim(x_min, x_max)
#plt.ylim(y_min, y_max)
#plt.axis('tight') #OR just fit plot around data automatically; but this usually fits *so* closely that it misses data
plt.legend()  #generate legend
plt.title('PCA for ' + str(len(novelnames)) + ' novels') #title of plot
ax = plt.subplot(111) #used in making legend

##PLACE LEGEND OUTSIDE OF PLOT
# Shrink current axis's height by 10% on the bottom
box = ax.get_position()
ax.set_position([box.x0, box.y0 + box.height * 0.1,
                 box.width, box.height * 0.9])

# Put a legend below current axis
ax.legend(loc='lower center', bbox_to_anchor=(0.5, -.15),
          fancybox=True, shadow=False, ncol=5)

# Add gridlines 
plt.grid(b=True, which='major', color='gray', linestyle='dotted')

#Plot Graph!
plt.show()

## PERFORM PCA AND GENERATE PCA GRAPH FOR ALL TEXT CHUNKS  ------------------
# (1) generate data points of PCA from MFW word list (this is where PCA is performed)
pca = sklearnPCA(n_components=2)
pca_coordinates = pca.fit_transform(MFWlist_array_for_PCA_for_chunks) #array of x- & y-coordinates 
pca_coordinates_for_chunks = pca_coordinates # will be gibbs sampled

# (2) plot PCA data
for text in range(0, len(novelnames)):
    print(chunk)
    plt.plot(pca_coordinates[document_chunk_index[text][0]:document_chunk_index[text][1],0], 
        pca_coordinates[document_chunk_index[text][0]:document_chunk_index[text][1],1],
        'o', markersize=7, color=colors_for_texts[text], alpha=0.5, label=novelnames[text])

# (3) select Graph Parameters
plt.xlabel('PC 1 ('+str(pca.explained_variance_ratio_[0]*100)+'%)') #x-axis title
plt.ylabel('PC 2('+str(pca.explained_variance_ratio_[1]*100)+'%)') #y-axis title
matplotlib.rcParams['figure.figsize'] = (12.0,12.0) #size of graph generated in notebook
#plt.xlim(x_min, x_max)
#plt.ylim(y_min, y_max)
#plt.axis('tight') #OR just fit plot around data automatically; but this usually fits *so* closely that it excludes data points
plt.legend()  #generate legend
plt.title('PCA for ' + str(len(novelnames)) + ' novels in chunks of '+str(chunk_size_used)+' words') #title of plot
ax = plt.subplot(111) #used in making legend

##PLACE LEGEND OUTSIDE OF PLOT
# Shrink current axis's height by 10% on the bottom
box = ax.get_position()
ax.set_position([box.x0, box.y0 + box.height * 0.1,
                 box.width, box.height * 0.9])

# Put a legend below current axis
ax.legend(loc='lower center', bbox_to_anchor=(0.5, -.15),
          fancybox=True, shadow=False, ncol=5)

# Add gridlines 
plt.grid(b=True, which='major', color='gray', linestyle='dotted')

#plot graph
plt.show()

In [None]:
#TESTS
print(len(document_chunk_index))
print(document_chunk_index)

In [None]:
## NOW WE WANT TO DO GIBBS SAMPLING ON THE RESULTING PCA DATA FOR CLUSTERS
##------------------------------------------------------------------------

## INITIAL PARAMETERS USING PCA CLUSTER DATA
#  Note: we can't generate a probability distribution from a point; so the Gibbs sampler
#  is sampling chunks only. That said, we can use the average of the chunks for each text
#  to obtain the prior for mu prior. 

## Definitions ------------------------------------------------------------------------------------

#  mu: the coordinates for the prior of the center of the gaussians probability densities; 
#      our prior (i.e., our initial "guess") is that this is the average of the chunk coordinates 
#      for a particular text
## Pick prior Mu_i for all texts 
mu = [] 
for text in range(0, len(novelnames)):
    mu.append([np.mean(pca_coordinates_for_chunks[document_chunk_index[text][0]:document_chunk_index[text][1],0]), 
             np.mean(pca_coordinates_for_chunks[document_chunk_index[text][0]:document_chunk_index[text][1],1])])
mu = np.array(mu)

#  sigma: the the variance for each text (i.e., each cluster); we assume that the variance will be
#      proportional to the number of chunks in the text such that sigma ~ number_of_chunks/2.
#      While this is an approximation, it is not a bad one.

## Pick sigmas for all texts
#  We "guess" that the variance will be proportional to the number of chunks in the text 
#  such that sigma ~ (1 + number_of_chunks/2)
sigma = []
for text in range(0, len(novelnames)):
    sigma.append([(document_chunk_index[text][1]+1-document_chunk_index[text][0])/2])
sigma = np.array(sigma)

##  K: number of clusters (i.e., number of texts)
K = len(novelnames) #Number of clusters (i.e., number of texts)

##  Z_i: text assignments for each chunk (not assumed (even though we know it, but found with gibbs sampling)
Z_i = np.random.randint(0, K, size=(len(pca_coordinates_for_chunks),1)) #generate random Z_i assignments

## number of iterations to run gibbs sampler
N_iterations = 15000 #number of iterations that will be performed for each cluster of chunks (i.e., each text)  

## the initial"guess" for the variance of the gaussian from which mu is calculated
lambda_ = np.array([1.0 for i in range(K)]).reshape(K,-1)
#mu = scipy.stats.multivariate_normal.rvs(cov=np.diagflat([lambda_hat[k], lambda_hat[k]]), size=K) #THese are guesses

## Total Number of Chunks
number_of_chunks = len(PCA_coordinates_for_chunks)

#TEST 1
print("Mu_k")
print(mu)
print("Sigma_k")
print(sigma)