## Using numpy arrays for frequency data

In this section we build a 2D array representing a textual data set. There are tools that will do what we doing here much faster than the code you write, but the idea is for you to gain a better understanding of what the actual computational representation of the data is by building it yourself.

Steps

1.  We loop through a set of documents getting the vocab counts for each.  Our final vocabulary is the union of the vocabularies of the documents, i.e., every word we've seen in the data set.
2. Let D be the number of documents.  Let V be the vocab size.  We build a DxV matrix M representing the frequency counts of each word in each document.  `M[i,j]` is the count of the `j`-th word in the `i`-th document.

The matrix M can be passed directly to a machine learning algorithm as its training matrix.


In [2]:
import numpy as np
import numpy.linalg as LA
from collections import Counter

docset = ['pride_and_prejudice','northanger_abbey',
          'emma', 'mansfield_park',
          'sense_and_sensibility', 'persuasion']
vocab = set()
doc_counts = []
for d in docset:
    with open('austen/{0}.txt'.format(d),'r') as fh:
        ctr = Counter(fh.read().lower().split())
        vocab.update(ctr.keys())
        doc_counts.append(ctr)

vocab = sorted(list(vocab))
(D,V) = (len(doc_counts), len(vocab))

M = np.zeros((D,V))
for i in range(D):
    doc_ctr_i = doc_counts[i]
    for j in range(V):
        M[i,j] = doc_ctr_i[vocab[j]]

# The words counts for each document.
doc_sizes = LA.norm(M,ord=1,axis=1)

def divide (x,y):
    return x/y

M_norm = np.apply_along_axis(divide,0,M,doc_sizes)
        

In [24]:
print sum(doc_sizes)
doc_sizes

735640.0


array([ 124588.,   80154.,  160449.,  162553.,  121590.,   86306.])

Note that `M` gives the counts for a word in a document, and `M_norm` the proportion of the document the word occupies.

In [11]:
print M.shape, M_norm.shape
print docset.index('northanger_abbey'), docset[1]
print docset.index('pride_and_prejudice'),docset[0]
print vocab.index('the'),vocab[34915]

print M[1,34915], M_norm[1,34915]
print M[0,34915], M_norm[0,34915]

(6, 39548) (6, 39548)
1 northanger_abbey
0 pride_and_prejudice
34915 the
3321.0 0.0414327419717
4479.0 0.0359504928243


Read the code above carefully, make sure you understand it. Answer the following questions about `M` and `M_norm`.

1.  What is the size of Jane Austen's vocabulary?  How many words long is the Jane Austen canon (the sum of the lengths [in words] of the 6 novels)?  Has Austen passed the million word milestone?
1.  How many times does "time" occur in *Pride and Prejudice*?  Show the computations you use to compute this. 
2.  How many times does "year" occur in *Sense and Sensibility*?  Show your work.
3.  Note that 'It' is not present in the vocabulary.
    ```
    >>> 'It' in vocab
    False
    ```
    Can we conclude that Jane Austen never starts a sentence with 'It'?  Why or why not?
4.  Compute an array which gives the counts for the word "gentleman" in all 6 Jane Austen novels.
4.  Which novel has the largest number of tokens of "truth"?  Show your computations.
5.  Which novel has the largest proportion of occurrences of "lady" and which the largest proportion of occurrences of "gentleman"?  Which word shows up more often in Austen's writings?
