In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Load in an english text dataset

Here I've used the text from the AMAZING book "War of the Worlds" by H.G. Wells.  Note: to simplify things we are going to - when we load in the text - convert all characters to *lower case*.

In [2]:
# load in text dataset
text = open("war_of_the_worlds.txt").read().lower()

Lets print the first chunk of characters from this text to see what it looks liie.

In [3]:
# print first 2000 characters from text
text[:2000]

"{\\rtf1\\ansi\\ansicpg1252\\cocoartf1504\\cocoasubrtf830\n{\\fonttbl\\f0\\fmodern\\fcharset0 courier;}\n{\\colortbl;\\red255\\green255\\blue255;\\red0\\green0\\blue0;}\n{\\*\\expandedcolortbl;;\\cssrgb\\c0\\c0\\c0;}\n\\margl1440\\margr1440\\vieww10800\\viewh8400\\viewkind0\n\\deftab720\n\\pard\\pardeftab720\\sl280\\partightenfactor0\n\n\\f0\\fs24 \\cf2 \\expnd0\\expndtw0\\kerning0\n\\outl0\\strokewidth0 \\strokec2 the project gutenberg ebook of the war of the worlds, by h. g. wells\\\n\\\nthis ebook is for the use of anyone anywhere at no cost and with\\\nalmost no restrictions whatsoever.  you may copy it, give it away or\\\nre-use it under the terms of the project gutenberg license included\\\nwith this ebook or online at www.gutenberg.net\\\n\\\n\\\ntitle: the war of the worlds\\\n\\\nauthor: h. g. wells\\\n\\\nrelease date: july, 1992 [ebook #36]\\\n[most recently updated october 1, 2004]\\\n\\\nlanguage: english\\\n\\\n\\\n*** start of this project gutenberg ebook the war of the 

# General pre-processing

Woah - tons of junk here.  From what I can see we need to

- cut first hundred or so characters of junk


- remove tags like e.g., 'newline' tag '\n' 


We need to throw out all these markup tags - probably some other nonsense characters too.  But lets start with tags.

In [4]:
# cut out first chunk of giberish text
text = text[947:]

# remove some obvious tag-related gibberish throughout
characters_to_remove = ['0','1','2','3','4','5','6','7','8','9','_','[',']','}','.  .  .','\\']
for i in characters_to_remove:
    text = text.replace(i,'')
    
# some gibberish that looks like it needs to be replaced with a ' '
text = text.replace('\n',' ')
text = text.replace('\r',' ')
text = text.replace('--',' ')
text = text.replace(',,',' ')
text = text.replace('   ',' ')

Lets visually examine the first chunk of text now.

In [5]:
text[:2000]

"the war of the worlds  by h. g. wells   but who shall dwell in these worlds if they be  inhabited? are we or they lords of the  world? and how are all things made for man?    kepler (quoted in the anatomy of melancholy)  book one  the coming of the martians  chapter one  the eve of the war no one would have believed in the last years of the nineteenth century that this world was being watched keenly and closely by intelligences greater than man's and yet as mortal as his own; that as men busied themselves about their various concerns they were scrutinised and studied, perhaps almost as narrowly as a man with a microscope might scrutinise the transient creatures that swarm and multiply in a drop of water.  with infinite complacency men went to and fro over this globe about their little affairs, serene in their assurance of their empire over matter.  it is possible that the infusoria under the microscope do the same.  no one gave a thought to the older worlds of space as sources of huma

All right - that looks much better.  If we so desire we can do even more pre-processing - remove unwanted punctuation, perhaps strange / rare words, etc,.

# Splitting the text into distinct words

With our text pre-processed we can now convert each unique word of the text into a unique number, so that we can then e.g., create a Markov model for word-based text generation.

First - we use the [CountVectorizer](http://scikit-learn.org/stable/modules/feature_extraction.html) functionality from scikit learn to split the text into its distinct words.  We can then find all of the unique words in the text, assign a number to each unique word, and then convert all words of the text into a corresponding sequence of numbers.  We'll do this one step at a time below.

Below we load in the ``CountVectorizer``, load in the text and compute its unique words.

In [6]:
# load in function from scikit learn that 
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([text])
analyze = vectorizer.build_analyzer()

# get all unique words in input corpus
tokens = analyze(text)
unique_words = vectorizer.get_feature_names() 

How many total words, unique words, and total number of characters in the text?

In [7]:
# total number of words, unique words, and characters in text
print ('this text dataset has ' + str(len(tokens)) + ' total words in it')
print ('this text dataset has ' + str(len(unique_words)) + ' unique words in it')
print ('this text dataset has ' + str(len(text)) + ' characters in it')

this text dataset has 57640 total words in it
this text dataset has 6717 unique words in it
this text dataset has 338645 characters in it


Lets visually examine the first chunk of unique words from the text.

In [8]:
# how many unique words are in the text?
print ('there are ' + str(len(unique_words)) + ' unique words in this text')

# examine the first 20 unique words
unique_words[:20]

there are 6717 unique words in this text


['abandoned',
 'abandoning',
 'abart',
 'abbey',
 'abbreviated',
 'abiding',
 'ability',
 'ablaze',
 'able',
 'aboard',
 'abounded',
 'about',
 'above',
 'abroad',
 'abrupt',
 'abruptly',
 'absence',
 'absent',
 'absolute',
 'absolutely']

Print out the first few 'tokens' - this is jargon for the text broken into individual words.  A 'token' is just a word.

In [9]:
# print out the first chunk of tokens from the text
tokens[:20]

['the',
 'war',
 'of',
 'the',
 'worlds',
 'by',
 'wells',
 'but',
 'who',
 'shall',
 'dwell',
 'in',
 'these',
 'worlds',
 'if',
 'they',
 'be',
 'inhabited',
 'are',
 'we']

These look fine - so now we can make a unique number per unique word.  It naturally makes sense to use a ``dictionary`` for this sort of thing.  Here we'll make two dictionaries - one that converts the unique words to unique numbers, and another that converts the unique numbers back into unique words.

In [10]:
# unique nums to map words too
unique_nums = np.arange(len(unique_words))

# this dictionary is a function mapping each unique word to a unique integer
words_to_keys = dict((i, n) for (i,n) in zip(unique_words,unique_nums))

# this dictionary is a function mapping each unique integer to a unique word
keys_to_words = dict((i, n) for (i,n) in zip(unique_nums,unique_words))

We can make sure that the two dictionaries mirror each other by printing out a random word / number from each list.

In [11]:
keys_to_words[120]

'albeit'

In [12]:
words_to_keys['albeit']

120

All right - makes sense.  With our unique word --> to unique number conversion dictionary created we can now shove the entire text through it and convert it to a sequence of numbers.  The jargon for the sequence of numbers used to represent a sequence of words is ``keys``.

In [13]:
# convert all of our tokens (words) to keys
keys = [words_to_keys[a] for a in tokens]

Lets print out the first chunk of tokens, and then shove the keys back through our converter to check that indeed the keys map back to each word properly.

In [14]:
# print first chunk of words, and first chunk of keys mapped back to words
print ([tokens[:20]])

[['the', 'war', 'of', 'the', 'worlds', 'by', 'wells', 'but', 'who', 'shall', 'dwell', 'in', 'these', 'worlds', 'if', 'they', 'be', 'inhabited', 'are', 'we']]


In [15]:
# confirm that first chunk of keys maps correctly back to words 
print([keys_to_words[keys[i]] for i in range(20)])

['the', 'war', 'of', 'the', 'worlds', 'by', 'wells', 'but', 'who', 'shall', 'dwell', 'in', 'these', 'worlds', 'if', 'they', 'be', 'inhabited', 'are', 'we']


In [16]:
# print first chunk of keys
print ([keys[i] for i in range(20)])

[5910, 6457, 3926, 5910, 6659, 716, 6528, 711, 6578, 5119, 1738, 2928, 5928, 6659, 2870, 5929, 409, 2995, 240, 6499]


Yes!  Seems to make sense.

# Splitting the text into distinct characters

Big picture

- Make sure to pre-process in the same way we do in the 'words' version of things.


- With the pre-processed text we just need to create dictionaries to convert each unique character to a unique number, and each number back to its unique character

First lets print out the number of characters, unique characters, and the unique characters themselves.

In [17]:
# count the number of unique characters in the text
unique_chars = sorted(list(set(text)))

# print some of the text, as well as statistics
print ("this corpus has " +  str(len(text)) + " total number of characters")
print ("this corpus has " +  str(len(unique_chars)) + " unique characters")
print (unique_chars)

this corpus has 338645 total number of characters
this corpus has 38 unique characters
[' ', '!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Convert characters to integers

In [18]:
# unique number range to map characters too
unique_nums = np.arange(len(unique_chars))

# this dictionary is a function mapping each unique character to a unique integer
chars_to_keys = dict((i, n) for (i,n) in zip(unique_chars,unique_nums))

# this dictionary is a function mapping each unique integer to a unique character
keys_to_chars = dict((i, n) for (i,n) in zip(unique_nums,unique_chars))

Lets test out our dictionary converters to make sure they work properly.

In [19]:
# convert all of our characters to keys
keys = [chars_to_keys[a] for a in text]

In [20]:
keys[:20]

[31, 19, 16, 0, 34, 12, 29, 0, 26, 17, 0, 31, 19, 16, 0, 34, 26, 29, 23, 15]

In [21]:
# confirm that first chunk of keys maps correctly back to words 
print([text[i] for i in range(21)])

['t', 'h', 'e', ' ', 'w', 'a', 'r', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'w', 'o', 'r', 'l', 'd', 's']


In [22]:
# confirm that first chunk of keys maps correctly back to words 
print([keys_to_chars[keys[i]] for i in range(21)])

['t', 'h', 'e', ' ', 'w', 'a', 'r', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'w', 'o', 'r', 'l', 'd', 's']


Everything looks good.