## Jane Austen assignment solution

## Using numpy arrays for frequency data

In this section we build a 2D array representing a textual data set. There are tools that will do what we doing here much faster than the code you write, but the idea is for you to gain a better understanding of what the actual computational representation of the data is by building it yourself.

Steps

1.  We loop through a set of documents getting the vocab counts for each.  Our final vocabulary is the union of the vocabularies of the documents, i.e., every word we've seen in the data set, even if it occurred in only one document.
2. Let D be the number of documents.  Let V be the vocab size.  We build a DxV matrix M representing the frequency counts of each word in each document.  `M[i,j]` is the count of the `j`-th word in the `i`-th document.

The matrix M can be passed directly to a machine learning algorithm as its training matrix.


In [1]:
import numpy as np
import numpy.linalg as LA
from collections import Counter

docset = ['pride_and_prejudice','northanger_abbey',
          'emma', 'mansfield_park',
          'sense_and_sensibility', 'persuasion']
vocab = set()
doc_counts = []
for d in docset:
    with open('austen/{0}.txt'.format(d),'r') as fh:
        ctr = Counter(fh.read().lower().split())
        vocab.update(ctr.keys())
        doc_counts.append(ctr)

vocab = sorted(list(vocab))
(D,V) = (len(doc_counts), len(vocab))

M = np.zeros((D,V))
for i in range(D):
    doc_ctr_i = doc_counts[i]
    for j in range(V):
        M[i,j] = doc_ctr_i[vocab[j]]

# The words counts for each document.
# A vector of length 6. Applying norm
# function to rows, M[0,:], M[1,:], M[2,:], etc., 
# i.e., slices along dimension 1
doc_sizes = LA.norm(M,ord=1,axis=1)

#def divide (x,y):
#    return x/y

# Divide every word count in a doc by the doc size of the doc
# Divide each column vector [a vector of length 6] elementwise
# by doc_sizes [another vector of length 6].
# We are applying the function to columns, 
# M[:,0],M[:,1,],M[:,2], etc., i.e., slices along dimension, 0.
M_norm = np.apply_along_axis(np.divide,0,M,doc_sizes)
        

Understanding the `apply_along_axis` function.  The basic idea is to apply any function of a 1D array, to either rows or columns of a matrix:

In [23]:
def my_average(a):
    """Find size of average element of a 1-D array"""
    return sum(a)/float(len(a))

# a 3x3 array.
#X = np.array([[1,2,3,4,5], [4,5,6,7,8], [7,8,9,10,11]])
X = np.array([[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15]])
X

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])




In [46]:
print 'Axis 0, apply my_average to 5 cols of X',np.apply_along_axis(my_average, 0, X)
print 'Axis 1, apply my_average to to 3 rows of X',np.apply_along_axis(my_average, 1, X)

Axis 0, apply my_average to 5 cols of X [  6.   7.   8.   9.  10.]
Axis 1, apply my_average to to 3 rows of X [  3.   8.  13.]


In computing `M_norm` above, we gave `apply_along_axis` one more argument `doc_sizes` which is interpreted as an additional argument of `np.divide`, so this means we divide every column elementwise by the document sizes vector.

Note that `M` gives the counts for a word in a document, and `M_norm` the proportion of the document the word occupies.

In [25]:
print M.shape, M_norm.shape
print docset.index('northanger_abbey'), docset[1]
print docset.index('pride_and_prejudice'),docset[0]
print vocab.index('the'),vocab[34915]

print M[1,34915], M_norm[1,34915]
print M[0,34915], M_norm[0,34915]

(6, 39548) (6, 39548)
1 northanger_abbey
0 pride_and_prejudice
34915 the
3321.0 0.0414327419717
4479.0 0.0359504928243


Read the code above carefully, make sure you understand it. Answer the following questions about `M` and `M_norm`.

1)  What is the size of Jane Austen's vocabulary?  How many words long is the Jane Austen canon (the sum of the lengths [in words] of the 6 novels)?

In [3]:
print 'Vocab size', len(vocab)
print 'Size of canon', sum(doc_sizes)

 Vocab size 39548
Size of canon 735640.0


In [4]:
doc_sizes

array([ 124588.,   80154.,  160449.,  162553.,  121590.,   86306.])

No, she has not passed the million word milestone.

2)  How many times does "time" occur in *Pride and Prejudice*?  Show the computations you use to compute this. 

In [5]:
time_index = vocab.index('time')
pride_and_prejudice_index = docset.index('pride_and_prejudice')
print time_index
print pride_and_prejudice_index 
print 'Count of "time" in "Pride and Prejudice":',M[pride_and_prejudice_index,time_index]

35427
0
Count of "time" in "Pride and Prejudice": 136.0


3)  How many times does "year" occur in *Sense and Sensibility*?  Show your work.

In [6]:
year_index = vocab.index('year')
sense_and_sensibility_index = docset.index('sense_and_sensibility')
print year_index
print sense_and_sensibility_index
print 'Count of "year" in "Sense and Sensibility":',M[sense_and_sensibility_index,year_index]

39291
4
Count of "year" in "Sense and Sensibility": 14.0


4)  Note that 'It' is not present in the vocabulary.  Can we conclude that Jane Austen never starts a sentence with 'It'?  Why or why not?

The entire string of each novel is lower cased before being passed to the counter for that novel:

```
ctr = Counter(fh.read().lower().split())
```

So we only know that the word `it` occurred, not whether it was lower or upper case.

5)  Compute an array which gives the counts for the word "gentleman" in all 6 Jane Austen novels.

In [7]:
gentleman_index = vocab.index('gentleman')
M[:,gentleman_index]

array([ 18.,  20.,  14.,  10.,  18.,  12.])

We see that one novel has 20 occurrences of `gentleman` in the count vector above.  That is the novel with index 1. To see which one that is, we do:

In [8]:
docset[1]

'northanger_abbey'

In two lines of code this is:

In [51]:
gentleman_counts = list(M[:,gentleman_index])
docset[gentleman_counts.index(max(gentleman_counts))]

'northanger_abbey'

Another equivalent solution (read up on `where` in the `numpy` docs):

In [26]:
gentleman_counts = M[:,gentleman_index]
# `where` returns a tuple of indices
max_doc_index = np.where(gentleman_counts==gentleman_counts.max())[0][0]
docset[max_doc_index]

'northanger_abbey'

6)  Which novel has the largest number of tokens of "truth"?  Show your computations.

In [27]:
truth_index = vocab.index('truth')
truth_counts = list(M[:,truth_index])
docset[truth_counts.index(max(truth_counts))]

'emma'

7)  Which novel has the largest proportion of occurrences of "lady" and which the largest proportion of occurrences of "gentleman"?  Which word shows up more often in Austen's writings?

In [28]:
lady_index = vocab.index('lady')
lady_props = list(M_norm[:,lady_index])
print 'Greatest proportion of tokens of "lady":', docset[lady_props.index(max(lady_props))]

Greatest proportion of tokens of "lady": persuasion


In [29]:
gentleman_index = vocab.index('gentleman')
gentleman_counts = list(M_norm[:,gentleman_index])
print 'Greatest proportion of tokens of "gentleman":', docset[gentleman_counts.index(max(gentleman_counts))]

Greatest proportion of tokens of "gentleman": northanger_abbey


In [30]:
lady_index = vocab.index('lady')
gentleman_index = vocab.index('gentleman')
total_lady_counts = sum(M[:,lady_index])
total_gentleman_counts = sum(M[:,gentleman_index])
print 'Total counts  "lady"', total_lady_counts
print 'Total counts  "gentleman"', total_gentleman_counts

Total counts  "lady" 716.0
Total counts  "gentleman" 92.0


Note: It's not even close.  Jane Austen writes about the world of women.  

An interesting fact reported to me by a student of Austen: There are no scenes in which a woman is not present in the entire Austen canon.

In [57]:
print M[:,lady_index]
print M[:,gentleman_index]

[ 166.   27.   32.  169.  122.  200.]
[ 18.  20.  14.  10.  18.  12.]


*Northanger Abbey* and *Emma* are the outliers for *lady*.

Temporal order:

1. Sense and Sensibility
2. Pride and Prejudice
3. Northanger Abbey
4. Mansfield Park
5. Emma
6. Persuasion
