Working with text
=================

This is a shortened version of the tutorial "Working with text" in the form of an interactive IPython notebook. This document is "live"; any code example can be edited and executed in the browser. To see this in action, change some part of the code in the *cell* below and then click on the play button above.

In [1]:
list_of_strings = ['working', 'with', 'text']
for s in list_of_strings:
    print(s)

working
with
text


Creating a document-term matrix
-------------------------------

Word frequencies and document-term matrices are typical units of
analysis when working with text collections. It may come as a surprise
that reducing a book to a list of word frequencies retains useful
information, but practice has shown this to be the case. Treating texts
as a list of word frequencies (a vector) also makes available a range of
mathematical tools developed for [studying and manipulating
vectors](http://en.wikipedia.org/wiki/Euclidean_vector#History).

> **Note**: Turning texts into unordered lists (or "bags") of words is easy in
> Python. [Python Programming for the
> Humanities](http://fbkarsdorp.github.io/python-course/) includes a
> chapter entitled [Text
> Processing](http://nbviewer.ipython.org/urls/raw.github.com/fbkarsdorp/python-course/master/Chapter%203%20-%20Text%20Preprocessing.ipynb)
> that describes the steps in detail.

This document assumes some prior exposure to text analysis so we will
gather word frequencies (or term frequencies) derived from the lists of
words appearing in texts into a document-term matrix using the
[CountVectorizer](http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
class from the [scikit-learn](http://scikit-learn.sourceforge.net/)
package. (For those familiar with R and the
[tm](http://cran.r-project.org/web/packages/tm/) package, this function
performs the same operation as `DocumentTermMatrix` and takes
recognizably similar arguments.)

First we need to import the functions and classes we intend to use,
along with our customary abbreviation for functions in the `numpy`
package.

In [2]:
import numpy as np  # a conventional alias
from sklearn.feature_extraction.text import CountVectorizer

Now we use the
[CountVectorizer](http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
class to create a document-term matrix. `CountVectorizer` is highly
customizable. For example, a list of "stop words" can be specified with
the ``stop_words`` parameter. Other important parameters include:

-   `lowercase` (default `True`) convert all text to lowercase before
    tokenizing
-   `min_df` (default `1`) remove from the vocabulary terms that occur
    in fewer than `min_df` documents–with a large corpus this may be set
    to `5` to eliminate rare words
-   `vocabulary` ignore words that do not appear in the list (or
    iterable) assigned to parameter `vocabulary`
-   `strip_accents` remove accents
-   `token_pattern` (default `u'(?u)\b\w\w+\b'`) regular expression
    identifying tokens–by default words that consist of a single
    character (e.g., 'a', '2') are ignored, setting `token_pattern` to
    `'(?u)\b\w+\b'` will include these tokens
-   `tokenizer` (default unused) use a custom function for tokenizing

For this example we will use texts by Jane Austen and Charlotte Brontë. These
texts are available in the *Datasets* section of the collected tutorials.


In [13]:
vectorizer = CountVectorizer(input='content')               
dtm = vectorizer.fit_transform(["test,123,house", "test"])  # a sparse matrix 
vocab = vectorizer.get_feature_names()

Now we have a document-term matrix and a vocabulary list. Before we can
query the matrix and find out, for example, how many times the word
'house' occurs in *Emma* (the first text in `filenames`), we need to
convert this matrix from its current format, a [sparse
matrix](http://docs.scipy.org/doc/scipy/reference/sparse.html), into a
normal NumPy array. We will also convert `vocab`, a list of vocabulary,
to an array of strings, as an array supports a greater variety of
operations.


In [14]:
# for reference, note the current class of `dtm`  
type(dtm)                                         
vocab = np.array(vocab)

In [15]:
vocab

array(['123', 'house', 'test'], 
      dtype='<U5')

> **Note:** A sparse matrix is used to store matrices that contain a significant
> number of entries that are zero. Typically, a sparse matrix only
> records non-zero entries. To understand why this matters so much
> that `CountVectorizer` returns a sparse matrix by default,
> consider a 4000 by 50000 matrix that is 60% zeros. In Python an
> integer takes up 4 bytes, so using a sparse matrix saves almost
> 500M, which is a significant amount of computer memory. (Remember
> that arrays are usually stored in memory, not on disk).

Querying the document-term matrix and the vocabulary is straightforward.
For example, here are two ways of finding how many times the word
'house' occurs in the first text, *Emma*:


In [20]:
# use the standard Python list method index(...)                    
house_idx = list(vocab).index('house')   
print(house_idx)
print(dtm[0, house_idx])                                                   
                                                                    
# alternatively, use NumPy indexing                                 
# in R this would be essentially the same, dtm[1, vocab == 'house'] 
print(dtm[0, vocab == 'house'])                                   


1
1
  (0, 0)	1


In [7]:
# verify that this is the result we anticipated
vocab[house_idx]

'house'

Sandbox
=======
Feel free to experiment with the document-term matrix `dtm` in the code cells below.

In [8]:
print(dtm.shape)
for fn in filenames:
    print(fn)

(6, 22854)
data/austen-brontë/Austen_Emma.txt
data/austen-brontë/Austen_Pride.txt
data/austen-brontë/Austen_Sense.txt
data/austen-brontë/CBronte_Jane.txt
data/austen-brontë/CBronte_Professor.txt
data/austen-brontë/CBronte_Villette.txt


In [9]:
print(len(vocab))
vocab[500:550]  # look at some of the vocabulary

22854


array(['abuse', 'abused', 'abuses', 'abusing', 'abusive', 'abyss',
       'acacia', 'acacias', 'academician', 'academicians', 'accede',
       'acceded', 'acceding', 'accelerate', 'accelerated', 'accent',
       'accented', 'accents', 'accentuated', 'accept', 'acceptable',
       'acceptably', 'acceptance', 'accepted', 'accepting', 'accepts',
       'access', 'accessible', 'accession', 'accessory', 'accident',
       'accidental', 'accidentally', 'accidently', 'accidents',
       'accommodate', 'accommodated', 'accommodating', 'accommodation',
       'accommodations', 'accompanied', 'accompanies', 'accompaniment',
       'accompaniments', 'accompany', 'accompanying', 'accompli',
       'accomplices', 'accomplish', 'accomplished'], 
      dtype='<U20')