# Week 3 Assignment

This notebook can be run using "Run All".

#### Computer Resources and Compute Time

From <i>systeminfo</i> command in powershell. The laptop also has an SSD used where the corpus documents are stored, so disk IO access is faster than legacy HDD.

In [12]:
# OS Name:                   Microsoft Windows 10 Home  
# Processor(s):              1 Processor(s) Installed.  
#                            [01]: Intel64 Family 6 Model 158 Stepping 9 GenuineIntel ~2808 Mhz  
# Total Physical Memory:     16,249 MB  
# Available Physical Memory: 9,627 MB  
# Virtual Memory: Max Size:  18,681 MB  
# Virtual Memory: Available: 9,427 MB  
# Virtual Memory: In Use:    9,254 MB 

Using almost the full number of corpus documents (2300/2307), I was able to process in just over 13 seconds using my custom solution. The scikit-learn method only took just over 4 seconds.  
  
Time elapsed during Sparse Matrix creation: 13.168128 seconds  
Time elapsed during scikit-learn Sparse Matrix creation: 4.120006 seconds  

#### Class definition and Implementation

In [7]:
import nltk
import time
import numpy as np
from nltk.corpus import PlaintextCorpusReader
from scipy.sparse import csr_matrix

# Define the corpus location and document pattern
corpus_root = r"C:\Users\camer\Documents\UCI CE Courses\I&C SCI_X426.77 Text Mining and Analytics for Machine Learning\Module 2\OANC-GrAF\data\spoken\telephone\switchboard"
file_pattern = r".*/.*\.txt"

# Define the number of corpus documents to read
NUM_DOCS = 2300

class W3A():
    """
    A class to hold logic for Week 3 Assignment.
        """

    def __init__(self, root, pattern, num_docs):
        """
        Initialize the corpus reader.
        """
        self.root = root
        self.pattern = pattern
        self.num_docs = num_docs 

        # Set up PCR for our Switchboard corpus
        try:
            self.sb_pcr = PlaintextCorpusReader(corpus_root, file_pattern)
        except OSError:
            print("Likely the Corpus root directory is incorrect.")
        else:
            if len(self.sb_pcr.fileids()) == 0:
                print("Zero files in corpus. Check corpus path and file pattern.")
            else:
                print("PlaintextCorpusReader object created successfully.")

    def createSparseMatrix(self):
        """
        Create sparse matrix of documents and terms.
        """
        indptr = [0]
        indices = []
        data = []
        vocab = {}
        
        # Build the lists that will be used to construct the sparse matrix
        for file in self.sb_pcr.fileids()[0:self.num_docs]:
            for word in self.sb_pcr.words(file):
                index = vocab.setdefault(word, len(vocab))
                indices.append(index)
                data.append(1)
            indptr.append(len(indices))
        
        try:
            self.csr = csr_matrix((data, indices, indptr), dtype=float)
        except:
            print('Sparse matrix creation failed.')
        
        # Get the raw count, i.e. f(t,d), by summing the duplicate entries
        self.csr.sum_duplicates()

    def createDictSumValues(self):
        """
        Create a dictionary of summed values to use for lookup.
        """
        dict_stack = {}
        for j in range(len(self.csr.data)):

            key = self.csr.indices[j]
            value = self.csr.data[j]

            if key in dict_stack:
                dict_stack[key] += value
            else:
                dict_stack[key] = value
                
        return dict_stack
    
    
# Record the starting time
started = time.time()

# Instantiate the Week 3 Assignment class with parameters set at the top of this code
week3 = W3A(corpus_root, file_pattern, NUM_DOCS)

# Create the sparse matrix as a attribute of the class
# (csr_matrix did not like to return from the function, maybe I needed to copy() it)
week3.createSparseMatrix()

# Create a copy of the csr for TF and TF-IDF results
tf_csr = week3.csr.copy()
tf_idf_csr = week3.csr.copy()

# Use the TF logic from page 63 of textbook
tf_csr.data = 1 + np.log(tf_csr.data)

# Create a dict to hold the sum of all the tokens in each document
dict_tf_idf = week3.createDictSumValues()

# Loop through the TF-IDF sparse matrix one document at a time
for j in range(len(tf_idf_csr.data)):
    
    # Use the TF-IDF logic from page 63 of the textbook
    tf_idf_csr.data[j] = tf_csr.data[j] * (np.log(1+(NUM_DOCS/dict_tf_idf[tf_idf_csr.indices[j]])))

elapsed = time.time() - started

# Print samples
print("\nSample of TF Sparse Matrix \n", tf_csr[0,0:10])
print("\nSample of TF-IDF Sparse Matrix \n", tf_idf_csr[0,0:10])

PlaintextCorpusReader object created successfully.

Sample of TF Sparse Matrix 
   (0, 0)	2.6094379124341005
  (0, 1)	1.6931471805599454
  (0, 2)	4.218875824868201
  (0, 3)	3.833213344056216
  (0, 4)	4.465735902799727
  (0, 5)	4.871201010907891
  (0, 6)	1.0
  (0, 7)	3.0794415416798357
  (0, 8)	4.13549421592915
  (0, 9)	1.0

Sample of TF-IDF Sparse Matrix 
   (0, 0)	0.9146215382538333
  (0, 1)	4.3730478241573545
  (0, 2)	0.2683777482650514
  (0, 3)	0.1898345072566897
  (0, 4)	0.07439399506963361
  (0, 5)	0.076499696774644
  (0, 6)	0.4640488357583941
  (0, 7)	0.3056586420608533
  (0, 8)	0.1400819670919766
  (0, 9)	1.1536720658511372


#### Compute Time

In [9]:
# Show the time taken to run the above process

print(f"Time elapsed during Sparse Matrix creation: {:06.6f} seconds".format(elapsed))

Time elapsed during Sparse Matrix creation: 13.168128 seconds


#### Extra Work: Create TF and TF-IDF matrices using scikit-learn

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

started = time.time()

# Use the defaults except read a list of file paths instead of a list of document text entries
vectorizer1 = CountVectorizer(input='filename')

# Note - we are reusing the PCR corpus object from the W3A class
freq_count = vectorizer1.fit_transform(week3.sb_pcr.abspaths()[0:NUM_DOCS])

# Set the min value at 1 and normalize the data using a natural log function
freq_count.data = 1 + np.log(freq_count.data)

# Use the defaults except read a list of file paths instead of a list of document text entries
vectorizer2 = TfidfVectorizer(input='filename',norm=None)

# Note - we are reusing the PCR corpus object from the W3A class
tf_idf = vectorizer2.fit_transform(week3.sb_pcr.abspaths()[0:NUM_DOCS])

elapsed2 = time.time() - started

# Print a (larger) sample (sklearn does not order the 2nd dimension)
print("\nSample of sklearn TF Sparse Matrix \n", freq_count[0])
print("\nSample of sklearn TF-IDF Sparse Matrix \n", tf_idf[0])


Sample of sklearn TF Sparse Matrix 
   (0, 25916)	1.0
  (0, 3162)	2.386294361119891
  (0, 10358)	1.0
  (0, 22909)	1.6931471805599454
  (0, 20081)	1.0
  (0, 14062)	1.0
  (0, 3248)	1.0
  (0, 23274)	1.0
  (0, 22390)	1.0
  (0, 5306)	1.0
  (0, 10791)	1.0
  (0, 23632)	1.0
  (0, 21924)	1.0
  (0, 4503)	1.0
  (0, 8537)	1.0
  (0, 6320)	1.0
  (0, 25342)	1.0
  (0, 8015)	1.0
  (0, 12353)	1.0
  (0, 23593)	1.0
  (0, 2010)	1.6931471805599454
  (0, 15169)	1.0
  (0, 24807)	1.0
  (0, 6633)	1.0
  (0, 14351)	1.0
  :	:
  (0, 24241)	3.8903717578961645
  (0, 11385)	3.8903717578961645
  (0, 25416)	3.0794415416798357
  (0, 25338)	3.3978952727983707
  (0, 10485)	3.302585092994046
  (0, 15925)	1.0
  (0, 15619)	3.833213344056216
  (0, 24206)	1.0
  (0, 15385)	1.6931471805599454
  (0, 6843)	3.0794415416798357
  (0, 25453)	3.4849066497880004
  (0, 860)	4.555348061489413
  (0, 25750)	2.386294361119891
  (0, 9045)	2.09861228866811
  (0, 7082)	3.1972245773362196
  (0, 25940)	4.367295829986475
  (0, 11025)	1.69314718055

#### Compute Time

In [11]:
# Show the time taken to run the above process

print("Time elapsed during scikit-learn Sparse Matrix creation: {:06.6f} seconds".format(elapsed2))

Time elapsed during scikit-learn Sparse Matrix creation: 4.120006 seconds
