# An introduction to computational text analysis: A more in-depth look at strings

In [1]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Strings have some simple but powerful methods that allow us to begin working with text in more complex ways. You saw how to import a .csv as a `datascience` Table in the last notebook, but what happens when we want to import text that is not nicely organized into rows and columns? We can organize the important parts into a nice tabular structure by first identifying parts that we want. 

**NOTE:** We will use hand-typed and plain text (.txt) file examples in this notebook, and since the rest of the class will focus on HathiTrust Research Center resources, **know that  .html, .json, and .xml file formats are important to computational text analysis but will not be covered in this class.** 

# Challenge 1

1. Store your first name in a variable named `first`.  
2. Store your last name in a variable named `last`.  
3. Convert `first` to all upper case letters.  
4. Convert `last` to all lower case letters.  
5. Combine these two string variables into a variable named `full`.  
6. Slice out `first` from `full`.  
7. Slice out `last` from `full`.  
8. Slice `full` so that it contains only the last 2 characters of your first name and the first two characters of your last name.  
9. Which string methods did you use in #3 and #4 above? How do you know they are string methods?

In [14]:
## YOUR CODE HERE
first = "Evan"
print(first)
last = "Muzzall"
print(last)

print(first.upper())
print(last.lower())

full = first + " " + last
print(full)

Evan
Muzzall
EVAN
muzzall
Evan Muzzall


In [20]:
full[:4]
#print(full[-7:])
print(full[5:])

Muzzall


In [22]:
middle = full[2:7]
print(middle)

an Mu


#### Bonus
What do the following commands return? Why? 

"cat" > "category"  
"cat " * 5  
"cat" + 2  

In [28]:
## YOUR CODE HERE
"cat" + "2"

'cat2'

# Jorge Luis Borges

![borges](img/borges_1921.jpg)

Below is a string of [Jorge Luis Borges'](https://en.wikipedia.org/wiki/Jorge_Luis_Borges) poem "On His Blindness":

In [36]:
borges = """In the fullness of the years, like it or not,
a luminous mist surrounds me, unvarying, 
that breaks things down into a single thing,
colorless, formless. Almost into a thought. 
The elemental, vast night and the day
teeming with people have become that fog
of constant, tentative light that does not flag,
and lies in wait at dawn. I longed to see
just once a human face. Unknown to me
the closed encyclopedia, the sweet play
in volumes I can do no more than hold, 
the tiny soaring birds, the moons of gold.
Others have the world, for better or worse; 
I have this half-dark, and the toil of verse."""

In [37]:
print(borges)

In the fullness of the years, like it or not,
a luminous mist surrounds me, unvarying, 
that breaks things down into a single thing,
colorless, formless. Almost into a thought. 
The elemental, vast night and the day
teeming with people have become that fog
of constant, tentative light that does not flag,
and lies in wait at dawn. I longed to see
just once a human face. Unknown to me
the closed encyclopedia, the sweet play
in volumes I can do no more than hold, 
the tiny soaring birds, the moons of gold.
Others have the world, for better or worse; 
I have this half-dark, and the toil of verse.


In [38]:
# make an unpreprocessed copy for use below
borges_dirty = borges

# Tokenization
Tokenization is the process of splitting text into words - each word is called a "token" and each word has a particular "type". However, a word such as "the" might adhere to multiple tokens of "the" within a text.

`.split` allows us to split the text based on some sort of separator. In this case, we want to split on the "whitespace" (the blank spaces between words). 

**NOTE:** remember to use your help files in the form of `help(borges.split)`

Let's just look at the first six words. 

In [39]:
borges.split()[:6]

['In', 'the', 'fullness', 'of', 'the', 'years,']

In [51]:
# challenge 2
borges.split(".")#[:6]

['In the fullness of the years, like it or not,\na luminous mist surrounds me, unvarying, \nthat breaks things down into a single thing,\ncolorless, formless',
 ' Almost into a thought',
 ' \nThe elemental, vast night and the day\nteeming with people have become that fog\nof constant, tentative light that does not flag,\nand lies in wait at dawn',
 ' I longed to see\njust once a human face',
 ' Unknown to me\nthe closed encyclopedia, the sweet play\nin volumes I can do no more than hold, \nthe tiny soaring birds, the moons of gold',
 '\nOthers have the world, for better or worse; \nI have this half-dark, and the toil of verse',
 '']

How many characters are there in `borges`?

In [40]:
print(len(borges))

599


How many words?

In [41]:
print(len(borges.split()))

110


How many lines? (hint: a line break is represented as \n)

In [42]:
print(len(borges.split("\n")))

14


How many stanzas?

In [43]:
print(len(borges.split("\n\n")))

1


At which index does the word "me" first appear?

In [44]:
print(borges.find("me")) # .find is "forward search"

72


At which index does the word "me" last appear?

In [45]:
print(borges.rfind("me")) # .rfind starts at the highest index and works in reverse

433


In [46]:
help(borges.rfind)

Help on built-in function rfind:

rfind(...) method of builtins.str instance
    S.rfind(sub[, start[, end]]) -> int
    
    Return the highest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.



How many unique words? (hint, use `set`!)

In [47]:
len(set(borges.lower().split()))

82

In [48]:
print(set(borges.lower().split()))

{'light', 'me', 'years,', 'closed', 'can', 'others', 'better', 'does', 'elemental,', 'in', 'thing,', 'to', 'see', 'dawn.', 'human', 'colorless,', 'day', 'flag,', 'wait', 'just', 'or', 'unvarying,', 'single', 'lies', 'with', 'gold.', 'half-dark,', 'play', 'soaring', 'no', 'fog', 'luminous', 'night', 'i', 'breaks', 'thought.', 'more', 'and', 'mist', 'have', 'sweet', 'people', 'almost', 'the', 'surrounds', 'teeming', 'constant,', 'not', 'of', 'not,', 'fullness', 'longed', 'face.', 'birds,', 'tentative', 'verse.', 'for', 'volumes', 'unknown', 'tiny', 'a', 'me,', 'that', 'like', 'down', 'vast', 'at', 'once', 'encyclopedia,', 'than', 'worse;', 'this', 'become', 'toil', 'hold,', 'into', 'do', 'formless.', 'things', 'it', 'world,', 'moons'}


# Challenge 2
1. Using your intutition, how might you split text on commas?
2. On periods?
3. How do you split _all_ of `borges` on whitespace so that all words are split and printed?

In [None]:
## YOUR CODE HERE

# Some fast notes about for-loops, functions, and conditionals in Python

Custom functions, for-loops, and conditionals are important tools that you will want to eventually explore. Since we will use a list comprehension (looping with accumulation) to remove punctuation below, let's take a minute to talk about these important topics.

### Custom functions

Just like built-in functions, custom functions take some inputs and give you back desired output(s). We can define our own custom functions to use over and over again. 

In this case, `def` tells Python that we want to define our own function. `square` is the name of the function and it needs two arguments to work `x` and `y`. 

The colon symbol `:` tells Python that the code to be evaluated comes on the indented line after it.  

`return` tells Python that the code after it should be printed out.

In [63]:
def sq_and_div(x,y):
    return (x**2)/y

In [65]:
sq_and_div(10,4)

25.0

### For loops

For loops are useful when you want to use the same code over a range of values, data, or files. 

`for` tells Python that we want to write a for loop. 

`x` is our "iterator" (placeholder) variable and range is the number of times to iterate. The colon symbol `:` again tells Python that the code to be evaluated follows.

In [66]:
for x in range(1, 13):
    print("The time is", x, "o'clock")

The time is 1 o'clock
The time is 2 o'clock
The time is 3 o'clock
The time is 4 o'clock
The time is 5 o'clock
The time is 6 o'clock
The time is 7 o'clock
The time is 8 o'clock
The time is 9 o'clock
The time is 10 o'clock
The time is 11 o'clock
The time is 12 o'clock


### Conditionals

Conditionals are statements that help you assign different conditions to different pieces of data. In the case below, `if` tells Python that "if some condition is met - do _this_". 

However, "if _some other condition_ is met - do something _else!_"

In [69]:
x = int(input("What time is it (PM)?"))
if x < 9:
    print("The time is", x, "o'clock")
else:
    print("It's getting late!")

What time is it (PM)?6
The time is 6 o'clock


# Removing punctuation

Remember how we imported that nice string of English punctuation in the first cell of this notebook? We could manually remove all of the punctuation using the `.replace` method, but this would get old fast!

In [72]:
print(type(punctuation))
print(punctuation)

<class 'str'>
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [73]:
borges_periods = borges.replace(".", " ")

In [74]:
print(borges_periods) # all periods have been successfully removed! 

In the fullness of the years, like it or not,
a luminous mist surrounds me, unvarying, 
that breaks things down into a single thing,
colorless, formless  Almost into a thought  
The elemental, vast night and the day
teeming with people have become that fog
of constant, tentative light that does not flag,
and lies in wait at dawn  I longed to see
just once a human face  Unknown to me
the closed encyclopedia, the sweet play
in volumes I can do no more than hold, 
the tiny soaring birds, the moons of gold 
Others have the world, for better or worse; 
I have this half-dark, and the toil of verse 


But, what if you have tons of text and don't know exactly what punctuation is present? A quick custom function can help us remove all the punctuation from `borges`, i.e. !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~)

In [77]:
for char in punctuation:
    borges = borges.replace(char, "")

In [78]:
print(borges)

In the fullness of the years like it or not
a luminous mist surrounds me unvarying 
that breaks things down into a single thing
colorless formless Almost into a thought 
The elemental vast night and the day
teeming with people have become that fog
of constant tentative light that does not flag
and lies in wait at dawn I longed to see
just once a human face Unknown to me
the closed encyclopedia the sweet play
in volumes I can do no more than hold 
the tiny soaring birds the moons of gold
Others have the world for better or worse 
I have this halfdark and the toil of verse


# Challenge 3

Describe what is happening in this remove punctuation function

In [None]:
for char in punctuation:
    borges = borges.replace(char, "")

# Tokenization with the `nltk` library

The [`nltk` (natural language toolkit)](https://nltk.readthedocs.io/en/latest/) library can also help you tokenize your text.

In [79]:
#!pip install nltk
from nltk.tokenize import word_tokenize

tokens = word_tokenize(borges)
tokens

['In',
 'the',
 'fullness',
 'of',
 'the',
 'years',
 'like',
 'it',
 'or',
 'not',
 'a',
 'luminous',
 'mist',
 'surrounds',
 'me',
 'unvarying',
 'that',
 'breaks',
 'things',
 'down',
 'into',
 'a',
 'single',
 'thing',
 'colorless',
 'formless',
 'Almost',
 'into',
 'a',
 'thought',
 'The',
 'elemental',
 'vast',
 'night',
 'and',
 'the',
 'day',
 'teeming',
 'with',
 'people',
 'have',
 'become',
 'that',
 'fog',
 'of',
 'constant',
 'tentative',
 'light',
 'that',
 'does',
 'not',
 'flag',
 'and',
 'lies',
 'in',
 'wait',
 'at',
 'dawn',
 'I',
 'longed',
 'to',
 'see',
 'just',
 'once',
 'a',
 'human',
 'face',
 'Unknown',
 'to',
 'me',
 'the',
 'closed',
 'encyclopedia',
 'the',
 'sweet',
 'play',
 'in',
 'volumes',
 'I',
 'can',
 'do',
 'no',
 'more',
 'than',
 'hold',
 'the',
 'tiny',
 'soaring',
 'birds',
 'the',
 'moons',
 'of',
 'gold',
 'Others',
 'have',
 'the',
 'world',
 'for',
 'better',
 'or',
 'worse',
 'I',
 'have',
 'this',
 'halfdark',
 'and',
 'the',
 'toil',
 'o

In [84]:
tokens = borges.split()
tokens

['In',
 'the',
 'fullness',
 'of',
 'the',
 'years',
 'like',
 'it',
 'or',
 'not',
 'a',
 'luminous',
 'mist',
 'surrounds',
 'me',
 'unvarying',
 'that',
 'breaks',
 'things',
 'down',
 'into',
 'a',
 'single',
 'thing',
 'colorless',
 'formless',
 'Almost',
 'into',
 'a',
 'thought',
 'The',
 'elemental',
 'vast',
 'night',
 'and',
 'the',
 'day',
 'teeming',
 'with',
 'people',
 'have',
 'become',
 'that',
 'fog',
 'of',
 'constant',
 'tentative',
 'light',
 'that',
 'does',
 'not',
 'flag',
 'and',
 'lies',
 'in',
 'wait',
 'at',
 'dawn',
 'I',
 'longed',
 'to',
 'see',
 'just',
 'once',
 'a',
 'human',
 'face',
 'Unknown',
 'to',
 'me',
 'the',
 'closed',
 'encyclopedia',
 'the',
 'sweet',
 'play',
 'in',
 'volumes',
 'I',
 'can',
 'do',
 'no',
 'more',
 'than',
 'hold',
 'the',
 'tiny',
 'soaring',
 'birds',
 'the',
 'moons',
 'of',
 'gold',
 'Others',
 'have',
 'the',
 'world',
 'for',
 'better',
 'or',
 'worse',
 'I',
 'have',
 'this',
 'halfdark',
 'and',
 'the',
 'toil',
 'o

# Sentence segmentation

Sentence segmentation deals with identifying sentence boundaries. We can do this by splitting on punctuation:

In [None]:
borges_dirty

In [80]:
borges_dirty.split(".")

['In the fullness of the years, like it or not,\na luminous mist surrounds me, unvarying, \nthat breaks things down into a single thing,\ncolorless, formless',
 ' Almost into a thought',
 ' \nThe elemental, vast night and the day\nteeming with people have become that fog\nof constant, tentative light that does not flag,\nand lies in wait at dawn',
 ' I longed to see\njust once a human face',
 ' Unknown to me\nthe closed encyclopedia, the sweet play\nin volumes I can do no more than hold, \nthe tiny soaring birds, the moons of gold',
 '\nOthers have the world, for better or worse; \nI have this half-dark, and the toil of verse',
 '']

# `nltk` can do sentence segmentation also!

In [81]:
from nltk.tokenize import sent_tokenize
sent_tokenize(borges_dirty)

['In the fullness of the years, like it or not,\na luminous mist surrounds me, unvarying, \nthat breaks things down into a single thing,\ncolorless, formless.',
 'Almost into a thought.',
 'The elemental, vast night and the day\nteeming with people have become that fog\nof constant, tentative light that does not flag,\nand lies in wait at dawn.',
 'I longed to see\njust once a human face.',
 'Unknown to me\nthe closed encyclopedia, the sweet play\nin volumes I can do no more than hold, \nthe tiny soaring birds, the moons of gold.',
 'Others have the world, for better or worse; \nI have this half-dark, and the toil of verse.']

We do this with the hopes of "normalizing" our text. There are many scenarios that make text non-normalized, but some common ones include:
- case folding (dealing with upper and lower case letters; generally, we want to make all text lower-case).
- removing URLs, digits, and hashtags
- infrequent word removal
- stop word removal

Regular expressions help out greatly with these! See me during office hours if you have further questions. 

# Count word frequencies

We can use Python's built-in function `Counter` to count words! Let's look at the most frequent twelve:

In [87]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

[('the', 9),
 ('of', 4),
 ('a', 4),
 ('that', 3),
 ('and', 3),
 ('have', 3),
 ('I', 3),
 ('or', 2),
 ('not', 2),
 ('me', 2)]

# Removing stop words

Yikes! The most common words in `borges` seem to be [stop words](https://en.wikipedia.org/wiki/Stop_words) such as "the", "of", and "a". Let's remove them because they are rarely useful in computational text analysis. 

In [106]:
import nltk
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/evan.admin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/evan.admin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/evan.admin/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [88]:
from nltk.corpus import stopwords
stop = stopwords.words("english")
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [95]:
freq2 = Counter(no_stops)
freq2.most_common(20)

[('I', 3),
 ('In', 1),
 ('fullness', 1),
 ('years', 1),
 ('like', 1),
 ('luminous', 1),
 ('mist', 1),
 ('surrounds', 1),
 ('unvarying', 1),
 ('breaks', 1),
 ('things', 1),
 ('single', 1),
 ('thing', 1),
 ('colorless', 1),
 ('formless', 1),
 ('Almost', 1),
 ('thought', 1),
 ('The', 1),
 ('elemental', 1),
 ('vast', 1)]

In [93]:
no_stops = [word for word in tokens if word not in stopwords.words('english')]
no_stops

['In',
 'fullness',
 'years',
 'like',
 'luminous',
 'mist',
 'surrounds',
 'unvarying',
 'breaks',
 'things',
 'single',
 'thing',
 'colorless',
 'formless',
 'Almost',
 'thought',
 'The',
 'elemental',
 'vast',
 'night',
 'day',
 'teeming',
 'people',
 'become',
 'fog',
 'constant',
 'tentative',
 'light',
 'flag',
 'lies',
 'wait',
 'dawn',
 'I',
 'longed',
 'see',
 'human',
 'face',
 'Unknown',
 'closed',
 'encyclopedia',
 'sweet',
 'play',
 'volumes',
 'I',
 'hold',
 'tiny',
 'soaring',
 'birds',
 'moons',
 'gold',
 'Others',
 'world',
 'better',
 'worse',
 'I',
 'halfdark',
 'toil',
 'verse']

# Stemming/lemmatization

Both of these terms seek to remove morphological affixes on words. 

If we stem the word "eats" we get "eat". If we stem the word "sleeping" we get "sleep". We stem words because we tend to focus more on the meaning of the core content of the word, rather than its tense. 

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

Let's try a few! 

In [96]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [97]:
stemmer.stem("eats")

'eat'

In [98]:
stemmer.stem("sleeping")

'sleep'

In [99]:
stemmer.stem("flying") # uh oh... flight

'fli'

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [None]:
print(snowballer_stemmer.stem("eats"))
print(snowballer_stemmer.stem("sleeping"))

In [None]:
print(lemmatizer.lemmatize("leaves")) # uh-oh...

# Part of speech tagging

Part of speech (POS) tagging assigns each token a part of speech! (i.e., noun, verg, adjective, etc.). 

Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input. When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [100]:
borges

'In the fullness of the years like it or not\na luminous mist surrounds me unvarying \nthat breaks things down into a single thing\ncolorless formless Almost into a thought \nThe elemental vast night and the day\nteeming with people have become that fog\nof constant tentative light that does not flag\nand lies in wait at dawn I longed to see\njust once a human face Unknown to me\nthe closed encyclopedia the sweet play\nin volumes I can do no more than hold \nthe tiny soaring birds the moons of gold\nOthers have the world for better or worse \nI have this halfdark and the toil of verse'

Whoops! We forgot to remove our line breaks. Let's do so now:

In [101]:
borges = borges.replace("\n", " ")

In [102]:
borges # looking good! :) 

'In the fullness of the years like it or not a luminous mist surrounds me unvarying  that breaks things down into a single thing colorless formless Almost into a thought  The elemental vast night and the day teeming with people have become that fog of constant tentative light that does not flag and lies in wait at dawn I longed to see just once a human face Unknown to me the closed encyclopedia the sweet play in volumes I can do no more than hold  the tiny soaring birds the moons of gold Others have the world for better or worse  I have this halfdark and the toil of verse'

In [103]:
from nltk import pos_tag
pos_borges = borges
pos_borges

'In the fullness of the years like it or not a luminous mist surrounds me unvarying  that breaks things down into a single thing colorless formless Almost into a thought  The elemental vast night and the day teeming with people have become that fog of constant tentative light that does not flag and lies in wait at dawn I longed to see just once a human face Unknown to me the closed encyclopedia the sweet play in volumes I can do no more than hold  the tiny soaring birds the moons of gold Others have the world for better or worse  I have this halfdark and the toil of verse'

In [104]:
tagged_borges = pos_tag(tokens)
tagged_borges

[('In', 'IN'),
 ('the', 'DT'),
 ('fullness', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('years', 'NNS'),
 ('like', 'IN'),
 ('it', 'PRP'),
 ('or', 'CC'),
 ('not', 'RB'),
 ('a', 'DT'),
 ('luminous', 'JJ'),
 ('mist', 'NN'),
 ('surrounds', 'NNS'),
 ('me', 'PRP'),
 ('unvarying', 'VBG'),
 ('that', 'IN'),
 ('breaks', 'JJ'),
 ('things', 'NNS'),
 ('down', 'RP'),
 ('into', 'IN'),
 ('a', 'DT'),
 ('single', 'JJ'),
 ('thing', 'NN'),
 ('colorless', 'NN'),
 ('formless', 'NN'),
 ('Almost', 'NNP'),
 ('into', 'IN'),
 ('a', 'DT'),
 ('thought', 'VBN'),
 ('The', 'DT'),
 ('elemental', 'JJ'),
 ('vast', 'JJ'),
 ('night', 'NN'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('day', 'NN'),
 ('teeming', 'VBG'),
 ('with', 'IN'),
 ('people', 'NNS'),
 ('have', 'VBP'),
 ('become', 'VBN'),
 ('that', 'IN'),
 ('fog', 'NN'),
 ('of', 'IN'),
 ('constant', 'JJ'),
 ('tentative', 'JJ'),
 ('light', 'NN'),
 ('that', 'WDT'),
 ('does', 'VBZ'),
 ('not', 'RB'),
 ('flag', 'VB'),
 ('and', 'CC'),
 ('lies', 'VBZ'),
 ('in', 'IN'),
 ('wait', 'NN'),
 ('

What might you conclude about Borges' style of writing based on the frequencies of non-stop words and stemmed words? 

# Why is preprocessing important?

Text preprocessing is an essential first step to coding and understanding machine learning algorithms. For machine learning portions of this course, we will focus on bag of words models, namely document-term and term frequency-inverse document frequency models from the [sklearn library](http://scikit-learn.org/stable/). 

As previously stated, these instructions can be improved upon using regular expressions. 

# Challenge 4

We can also open data from files. Let's open up the "poe.txt" file from the materials you downloaded earlier. This contains the poem "A Dream Within a Dream" by Edgar Allen Poe. 

Repeat the instructions in this notebook using Poe's poem. 

In [107]:
with open("./poe.txt", "r") as myfile:
    poe = myfile.read()

In [108]:
print(poe)

Take this kiss upon the brow!
And, in parting from you now,
Thus much let me avow —
You are not wrong, who deem
That my days have been a dream;
Yet if hope has flown away
In a night, or in a day,
In a vision, or in none,
Is it therefore the less gone?  
All that we see or seem
Is but a dream within a dream.

I stand amid the roar
Of a surf-tormented shore,
And I hold within my hand
Grains of the golden sand —
How few! yet how they creep
Through my fingers to the deep,
While I weep — while I weep!
O God! Can I not grasp 
Them with a tighter clasp?
O God! can I not save
One from the pitiless wave?
Is all that we see or seem
But a dream within a dream?



In [109]:
len(poe.split("\n"))

26

In [None]:
## YOUR CODE HERE