# Humanities data: what is it really?

Humanities work can take many many forms, but usually involves text.  This text may be:

* derivatives from text, e.g. bag of words
* transcriptions of intererviews
* unstructured free text
* semi-strucutred data
* qualitative coded values
* lists of resources
* and many more

The data area where Python really shines is the manipulation of text structures.

## But what about R?

R is another language, focused on statistics.  It has great support for modeling, viz, etc.  Many text modeling/analysis tools are written in R.  So you may need to focus your efforts in learning R.  That said, the basic concepts that we're covered this morning and will look at this afternoon will carry over to working in R.  The structures in R are a bit different, so it isn't 1:1, but knowing the concept of a variable, the concept of a string, the concept of numerical types, etc. will serve you very well in the transition.

## You don't have to choose just one platform

Choosing the right tool for the right job is a key concept for programming.  Sometimes an entirely different language can save your a lot of trouble.  Also, the digital humanities crowd is pretty split between R and Python, so being functional in both will serve you well.

Python handles un-/semi-structured data very well, so you might find it easier to wrangle all your raw data in Python, then produce your analysis data file for processing in another program (e.g. produce coordinates for ArcGIS, network data for Gephi, categorical codes for R, etc.)

You don't need to pick and work with only one program.  You'll want to use many tools.


# Breaking down a problem

I want to focus more on design and give you some of the essential patterns to experiment with.  This is the best way to learn.  From a pratitioner standpoint, starting with the core patterns and tinkering with them allows you to get things done first and then you'll fill in the theory along the way.

### A basic problem:  Count the words in the Raven

In this problem we'll explore reading in text, manipulating it to get our data out, and then writing out data for ourselves.  Writing programs is a very iterative and non-linear process.  Much like starting off with an outline of a paper, you'll have clearer ideas on how to handle certain tasks that others.

Even though you've never seen this code, we can start somewhere!  We know all these things need to happen, even if we aren't sure how yet.

1. Read in the text
2. Count the words. I might need to toss out some stop words, but I don't know what those are yet.
3. Assemble some data
4. Make a data file where I have:  word, count

This is a great start, and pinpoints many of the problems that we need to solve.  I usually like to start filling in the easy chunks first, so I can focus on adapting the middle chunk.  Usually the juicy peices, e.g. "counting the words" appear as just a few lines in middle of a whole load of prep.

# Reading in text

There are multiple methods of reading in text, but I'm going to show you one of the classic methods.  We're very used to programs handling file opening, writing, editing, etc. for us, but when it comes to programming languages, we have to do that all ourselves.  We're used to opening and closing files within a GUI, but we'll be doing that more explicitly inside of our code.

`variable = open(filename, mode)` is our basic pattern for file handling.  The filename is just a string of the file path, and the mode is a string (https://docs.python.org/3.6/library/functions.html#open) to declare which mode you'd like to oepn the file as.  There are many, but you're going to usually use:

* 'r' for reading in a file
* 'w' for writing to a file
* 'a' for append (not the most common)

Once you're done doing what you need to a file, you'll want to run:  `variable.close()`.  Once you close a file you will no longer have read or write access to that IO object.

The more specific pattern for handling files is:

1. Create a file IO object in the desired mode
    * `file_in = open('raven.txt', 'r')`
    * `file_out = open('raven_data.txt', 'w')`
    * When writing out a file, if that file name does not exist it will create it.  If it does exist, in `'w'` mode it'll overwrite it, and `'a'` mode will add stuff to the end of it. 
2. Do your stuff to the file
    * We haven't learned these yet!
3. Close the file
    * `file_in.close()`
    * `file_out.close()`

## .read() the oldie but goodie

There are many methods of reading in a file, but .read() is sort of the "when unsure..." option.

In [4]:
file_in = open('raven.txt', 'r') # create IO object in read mode
text = file_in.read() # apply .read() to the IO object and store the results as variable text
file_in.close()

In [2]:
print(text)

Once upon a midnight dreary, while I pondered, weak and weary, 
Over many a quaint and curious volume of forgotten lore, 
While I nodded, nearly napping, suddenly there came a tapping, 
As of some one gently rapping, rapping at my chamber door. 
"'Tis some visitor," I muttered, "tapping at my chamber door- 
                Only this, and nothing more." 

Ah, distinctly I remember it was in the bleak December, 
And each separate dying ember wrought its ghost upon the floor. 
Eagerly I wished the morrow;- vainly I had sought to borrow 
From my books surcease of sorrow- sorrow for the lost Lenore- 
For the rare and radiant maiden whom the angels name Lenore- 
                Nameless here for evermore. 

And the silken, sad, uncertain rustling of each purple curtain 
Thrilled me- filled me with fantastic terrors never felt before; 
So that now, to still the beating of my heart, I stood repeating, 
"'Tis some visitor entreating entrance at my chamber door- 
Some late visitor entreating ent

In [3]:
text # running it like this better shows the raw content of the file.

'Once upon a midnight dreary, while I pondered, weak and weary, \nOver many a quaint and curious volume of forgotten lore, \nWhile I nodded, nearly napping, suddenly there came a tapping, \nAs of some one gently rapping, rapping at my chamber door. \n"\'Tis some visitor," I muttered, "tapping at my chamber door- \n                Only this, and nothing more." \n\nAh, distinctly I remember it was in the bleak December, \nAnd each separate dying ember wrought its ghost upon the floor. \nEagerly I wished the morrow;- vainly I had sought to borrow \nFrom my books surcease of sorrow- sorrow for the lost Lenore- \nFor the rare and radiant maiden whom the angels name Lenore- \n                Nameless here for evermore. \n\nAnd the silken, sad, uncertain rustling of each purple curtain \nThrilled me- filled me with fantastic terrors never felt before; \nSo that now, to still the beating of my heart, I stood repeating, \n"\'Tis some visitor entreating entrance at my chamber door- \nSome late v

## .readlines() for more of a delicate touch

The other main method of reading in text is .readlines(), which does a lot of stuff for you.  But you do need to use it more purposefully.  We know that the text is organized by lines, and there are newlines aplenty.  .readlines() will create a list where each line of text is a separate string within that list.

So a text that looks like:

```
I am a meat popsicle.
I live to be eaten.
My soul is frozen.
There is a stick.```

Will be turned into:

```
[ ['I am a meat popsicle.\n'], 
  ['I live to be eaten.\n'],
  ['My soul is frozen.\n'],
  ['There is a stick.\n'] ]
```

Sometimes you want this structure because your plan is to do stuff line by line.  But somethings you just want access to the entire document in a single flat structure.

## There is no one perfect way to read in a file

There is no one perfect method of reading in a file, exept for the method that is right for your project.  You'll also develop your own personal style over time.  Which is fine!  Program development is an iterative process, so you may start down one path and then have to backtrack and pivot to another method.  Eventually you'll get a feel for which method is better for certain processes.

The goal is to have the content in the structure that you want.  How you accomplish that isn't as important as the accuracy of the end product.  Focus on that rather than number of lines of code, etc.  

## But which one do we want in this case?


Let's go back to our original goal: counting the words.  Our first question is, what is the unit being measured?  In this case, we want the counts for the whole document.  Every word treated equally.  We aren't splitting it apart, and we don't care about lines.  We just want a bag of words for the entire thing.

That's a pretty easy case for `.read()` since the newline breaks are a little meaningless.  We've already got that code above, so we'll copy it down here again.

In [4]:
file_in = open('raven.txt', 'r') # create IO object in read mode
text = file_in.read() # apply .read() to the IO object and store the results as variable text
file_in.close()

In [5]:
# for the sake of brevity, we'll be printing out the first 300 characters of the file
# rather than the entire thing
print(text[:200])

Once upon a midnight dreary, while I pondered, weak and weary, 
Over many a quaint and curious volume of forgotten lore, 
While I nodded, nearly napping, suddenly there came a tapping, 
As of some one


# Essential string processing

When it come to counting things in text, we need to remember that Python doesn't entirely know what a word is.  Nor does it care.  This `text` object that we have stored in Python is a series of characters.  This means we'll need to describe what a word is on a character level.  We won't dive super deep into this question because that isn't the point of this lesson, but there are a few things we can do to explore the application of some of our theoretical structures.

As a first pass, let's say that "words are things separated by white space".  Which we know isn't the complete story, but it is sufficiently true that it is a good place to start.

## `.split()`

This function is a real work horse for text processing.  Say you have a long string with regular delimiters and you want to break it apart into those components.  `.split()` will do just that.  

* It operates on:  a string you want to be broken up
* You give it:  a string that is the delimiter
* It gives you:  a list, with each element being the chunk of characters between each of those delimiters.  The delimiter you pass it will be 'consumed', meaning that it will be erased from the content returned back to you.
* Pattern:  `new_text = original_text.split(my_delimiter)`  This method will not change the original string and returns a new list to you.  Meaning that you'll have to save that list value in a variable in order to keep it around.

In [6]:
maybe_words = text.split(" ") # see that single space in the quotes?

# let's investigate this object before we look at it

print(len(maybe_words)) # so there are a bunch, this seems in the right area
print(type(maybe_words)) # this is a list, so I need to manipulate it like one

1378
<class 'list'>


# Activity!  

Pair up and discuss what you're seeing inside of `maybe_words[:100]`.

1. What content do you see in the strings?
2. What words?
3. Punctuation?
4. Other observations?

In [7]:
# given that there are 1000+ words, let's just look at a slice

print(maybe_words[:100])

['Once', 'upon', 'a', 'midnight', 'dreary,', 'while', 'I', 'pondered,', 'weak', 'and', 'weary,', '\nOver', 'many', 'a', 'quaint', 'and', 'curious', 'volume', 'of', 'forgotten', 'lore,', '\nWhile', 'I', 'nodded,', 'nearly', 'napping,', 'suddenly', 'there', 'came', 'a', 'tapping,', '\nAs', 'of', 'some', 'one', 'gently', 'rapping,', 'rapping', 'at', 'my', 'chamber', 'door.', '\n"\'Tis', 'some', 'visitor,"', 'I', 'muttered,', '"tapping', 'at', 'my', 'chamber', 'door-', '\n', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Only', 'this,', 'and', 'nothing', 'more."', '\n\nAh,', 'distinctly', 'I', 'remember', 'it', 'was', 'in', 'the', 'bleak', 'December,', '\nAnd', 'each', 'separate', 'dying', 'ember', 'wrought', 'its', 'ghost', 'upon', 'the', 'floor.', '\nEagerly', 'I', 'wished', 'the', 'morrow;-', 'vainly']


When we take a closer look at what this has done, it's chosen to separate out long stretches of white space into a group of empty strings.  It wants to find the content betwee delimiters, but when there's no content?  This actully beneficial in other contexts, if super annoying in this one.

So you can see that it turns a list of 4 spaces into a list of list of 5 empty strings.  You'll also see that newline characters were retained and added to the subsequent words.

In [9]:
print("    ".split(" "))

['', '', '', '', '']


The other way to use `.split()` is by giving it no arguments.  By default, it'll split on any amount of white space.  It'll also treat repeating instances of white space characters as a single delimiter.

In [8]:
print("    ".split()) # so now it's empty

[]


In [12]:
maybe_words = text.split() # not passing it anything!

# let's investigate this object before we look at it

print(len(maybe_words)) # fewer words this time!
print(type(maybe_words)) # still a list

1089
<class 'list'>


In [13]:
print(maybe_words[:100])

['Once', 'upon', 'a', 'midnight', 'dreary,', 'while', 'I', 'pondered,', 'weak', 'and', 'weary,', 'Over', 'many', 'a', 'quaint', 'and', 'curious', 'volume', 'of', 'forgotten', 'lore,', 'While', 'I', 'nodded,', 'nearly', 'napping,', 'suddenly', 'there', 'came', 'a', 'tapping,', 'As', 'of', 'some', 'one', 'gently', 'rapping,', 'rapping', 'at', 'my', 'chamber', 'door.', '"\'Tis', 'some', 'visitor,"', 'I', 'muttered,', '"tapping', 'at', 'my', 'chamber', 'door-', 'Only', 'this,', 'and', 'nothing', 'more."', 'Ah,', 'distinctly', 'I', 'remember', 'it', 'was', 'in', 'the', 'bleak', 'December,', 'And', 'each', 'separate', 'dying', 'ember', 'wrought', 'its', 'ghost', 'upon', 'the', 'floor.', 'Eagerly', 'I', 'wished', 'the', 'morrow;-', 'vainly', 'I', 'had', 'sought', 'to', 'borrow', 'From', 'my', 'books', 'surcease', 'of', 'sorrow-', 'sorrow', 'for', 'the', 'lost', 'Lenore-']


Now this is looking a lot more reasonable!  So let's play with a neat tool, the `Counter()`.

In [17]:
from collections import Counter

counted_words = Counter(maybe_words)

#hmmm, what is this
print(len(counted_words)) # so there are fewer items in here
print(type(counted_words)) # not a list?

560
<class 'collections.Counter'>


In [18]:
print(counted_words[:50])

TypeError: unhashable type: 'slice'

This notation doesn't work because we're not dealing with a list anymore!  We're dealing with a new object type, a Counter object.  Under the hood, it look sa lot like a dictionary.  Anyhow, this object is unordered, so we can't go by index position.  When we print the entire thing out:

`print(counted_words)` 

We can see that it starts with `Counter({'the': 56, 'and': 30, 'I': 27,`...  These are key:value pairs.  So `'the'` is a key, and that `56` is the value.  In this case, all the unique words in the list appear as the keys, and the number of times they appear are the values.  

Don't believe me? We can check this.

In [20]:
print(len(maybe_words))
print(sum(counted_words.values()))

1089
1089


`Counter` objects have all kinds of fun data exploration methods.  Most importantly, we can look at the most common.

# Activity!

Work with your partner to discuss the results of `print(counted_words.most_common(20))`

1.  What are you observations of the counts?
2.  What's happening with the words that you see?
3.  Is this analysis done?  Can we do more to refine our text for better results?

In [24]:
print(counted_words.most_common(20))

[('the', 56), ('and', 30), ('I', 27), ('my', 24), ('of', 20), ('that', 15), ('a', 15), ('chamber', 11), ('this', 11), ('"Nevermore."', 8), ('And', 8), ('at', 8), ('bird', 8), ('is', 8), ('from', 7), ('with', 7), ('in', 7), ('no', 7), ('above', 7), ('or', 7)]


These are the 12 most common words in the document, and all that counting has been done for us.  While cool, we can already see some limitations to our methods.

* `'and'` and `'And'` have been counted as separate words
* `'Nevermore.`' has a period on the end
* very few of these words are pretty interesting

We can tackle all three of these issues.

## Lowercase things for better counting

Recall our discussion of that ASCII table and that `A` and `a` are totally separate letters in the eyes of Python.  Sometimes we care about this, such as an instance of someone named Rose versus someone receiving a rose.  But more often than not, we need to lowercase everything to get better counts.  You could uppercase everything to standardize stuff for counting...but pain.  

## `.lower()`

We can use the `.lower()` method to lowercase every character.

* It operates on:  a string you want to lowercase
* You give it:  usually nothing, so leave the `()` empty
* It gives you:  A new string, with all characters lowercased (that can be). This method will not change the original string and returns a new list to you.  Meaning that you'll have to save that list value in a variable in order to keep it around.
* Basic pattern:  `lower_string = original_string.lower()`

Let's start from the top.

In [35]:
file_in = open('raven.txt', 'r') # create IO object in read mode
text = file_in.read() # apply .read() to the IO object and store the results as variable text
file_in.close()

lower_text = text.lower() # our new addition!

maybe_words = lower_text.split()
counted_words = Counter(maybe_words)

print(counted_words.most_common(20))

[('the', 56), ('and', 38), ('i', 27), ('my', 24), ('of', 21), ('that', 17), ('this', 15), ('a', 15), ('chamber', 11), ('on', 10), ('bird', 9), ('is', 9), ('from', 8), ('with', 8), ('at', 8), ('in', 8), ('"nevermore."', 8), ('by', 7), ('above', 7), ('then', 7)]


We can see that the `('and', 30)` and `('And', 8)` have been combined into `('and', 38)`.  

But we've still got that punctuation thing going.  We can see this is a problem with `('"nevermore."', 8)`. I can show you a bit more advanced work to highlight that we are actually losing data with this punctuation thing.

In [36]:
for word in maybe_words:
    if 'nevermore' in word:
        print(word)

"nevermore."
"nevermore."
"nevermore."
nevermore'."
"nevermore."
nevermore!
"nevermore."
"nevermore."
"nevermore."
"nevermore."
nevermore!


This checks for any unbroken instance of 'nevermore' inside of a string and prints it out if it is found.  We can see that there are several versions of this.  There are 11 results here with three different kinds of punctuation issues.

## Getting rid of punctuation

This is something a little more involved, so hang on with me.

It would be great to be able to say "replace all these things with nothing", but when you have multiple things to replace with one thing...things get weird.  There are many ways to solve this problem, this is just one.

What we're going to do is take advantage of the regular expression module.  That is a whole other conversation, but regualr expressions are a way to describe text patterns.  We create a character class that has all the desired punctuation characters in it, and tell it to replace any match with an empty string (effectively, nothing).

Character classes in regular expressions are denoted by square brackets.  For example `[abc]` would match `a`, or `b`, or `c`.  We'll need to first import the re module, which contains all the regular expression tools for Python, and then use the `re.sub()` function.

`re.sub()` works a little differently from the other methods that we've explored.  Instead of it being called on the string you want to change, you call the function and pass it three arguments:  the pattern for replacement (in our case the character class), the replacement value (an empty string for this), and the string to be changed.

This is also one of our first instances of calling a function from another module.  We sort of did this with Counter, but when we imported it directly from the module we don't need to use the module name.  The more generic `import module_name` will require that we put `module_name` before the function name.

We've got a variable with a string of the letters to erase, and need to transform that into a character class.  Remember that character classes are a series of characters inside square brackets.  So `'[' + string + ']'` should do the trick.

In [30]:
import re

original = "Hello, human meat creature"

letters_to_erase= 'aeiou'

print(re.sub("[" + letters_to_erase + "]", '', original))

Hll, hmn mt crtr


We are now capable of taking a string, making a character class for items to remove, and then running that through `re.sub()` to create a new string without those characters.

The only thing missing is how to get all of the punctuation characters.  The handy `string` module is our friend in this case.  It has a series of commands that produce a variety of latin-1 based letter sequences.  This may not save characters in our code, but makes our intent clearer by having dedicated names.

We'll need to import string like we did with re, meaning that we need to add `string.` before all our calls.  You can read all about the other items in the string module here: https://docs.python.org/3.6/library/string.html

Special note: while we've gotten used to having `()` on the end of our function calls, this is a little different.  Internally, these aren't actually functions, they're variables inside the module that we're getting values for.  Anyhow, you don't care why, just remember that you omit the `()` when trying to access the string values from the string module

In [31]:
import string

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


That's a lot easier than trying to type that a whole bunch of times.  

Let's put all this together now.  Recall that our `lower_text` variable has our lowecased text.

In [32]:
import re
import string

file_in = open('raven.txt', 'r') # create IO object in read mode
text = file_in.read() # apply .read() to the IO object and store the results as variable text
file_in.close()

lower_text = text.lower() # our new addition!

print(re.sub('[' + string.punctuation + ']', '', lower_text))

once upon a midnight dreary while i pondered weak and weary 
over many a quaint and curious volume of forgotten lore 
while i nodded nearly napping suddenly there came a tapping 
as of some one gently rapping rapping at my chamber door 
tis some visitor i muttered tapping at my chamber door 
                only this and nothing more 

ah distinctly i remember it was in the bleak december 
and each separate dying ember wrought its ghost upon the floor 
eagerly i wished the morrow vainly i had sought to borrow 
from my books surcease of sorrow sorrow for the lost lenore 
for the rare and radiant maiden whom the angels name lenore 
                nameless here for evermore 

and the silken sad uncertain rustling of each purple curtain 
thrilled me filled me with fantastic terrors never felt before 
so that now to still the beating of my heart i stood repeating 
tis some visitor entreating entrance at my chamber door 
some late visitor entreating entrance at my chamber door 
            

In [41]:
# and now we recount things using this cleaned text

import re
import string
from collections import Counter

file_in = open('raven.txt', 'r') # create IO object in read mode
text = file_in.read() # apply .read() to the IO object and store the results as variable text
file_in.close()

lower_text = text.lower() # our new addition!

no_punc_lower_text = re.sub('[' + string.punctuation + ']', '', lower_text)

maybe_words = no_punc_lower_text.split()

counted_words = Counter(maybe_words)

print(counted_words.most_common(15))

[('the', 56), ('and', 38), ('i', 32), ('my', 24), ('of', 21), ('this', 17), ('that', 17), ('a', 15), ('door', 14), ('chamber', 11), ('nevermore', 11), ('is', 11), ('bird', 10), ('on', 10), ('raven', 10)]


This isn't too terribly different from before, but we can see that ` ('"nevermore."', 8)` has actually changed to `('nevermore', 11)`.  So there must have been other variations of the word floating around our original list.

## Excluding certain words

This will not be a complete discussion of stop words, why you want to exclude them, and what's happening here.  We can see that many fo our most common words are common words required to construct syntatically correct English, rather than words that hold interesting meaning within the poem.  We'd like to remove them before counting.

This will have us explore a basic accumulator pattern.  Meaning that we will traverse over our list of words, check if they do or don't belong, and then gather up all the words that we want.  Accumulator patterns are classic and prevalent in nearly every coding pattern when dealing with data.

Basic accumulator pattern:

```python
empty_bucket = [] # an empty list to gather our words

for word in list_of_words: # loop over your list of words
    # do something to either change the word or test something about it
    empty_bucket.append(word) # add the word to our accumulator variable
```

Our first step will be to make the list of words we don't want, which we can do so by looking at the top list of counted words and picking some out that don't interest us.


In [42]:
words_to_omit = ['the', 'and', 'i', 'my', 'of', 'this', 'that', 'a', 'is', 'on']

# we can leave them lower case because we're only going to be
# traversing over our list of lowercase words
# these are from my top 15 words, so let's see what happens

words_i_want = []

for word in maybe_words:
    if word in words_to_omit:
        continue
    else:
        words_i_want.append(word)

We're using this `in` keyword again.  We previously used it to check if there's an unbroken instance of 'nevermore' in a larger string.  Now we're using it to check membership of an item in a list.

We've also added in an `else:` statement, which pairs up with that `if`.  It will execute in conditions where the if statement is false.

But let's view our newly accumulated list of words.

In [43]:
print(words_i_want)

['once', 'upon', 'midnight', 'dreary', 'while', 'pondered', 'weak', 'weary', 'over', 'many', 'quaint', 'curious', 'volume', 'forgotten', 'lore', 'while', 'nodded', 'nearly', 'napping', 'suddenly', 'there', 'came', 'tapping', 'as', 'some', 'one', 'gently', 'rapping', 'rapping', 'at', 'chamber', 'door', 'tis', 'some', 'visitor', 'muttered', 'tapping', 'at', 'chamber', 'door', 'only', 'nothing', 'more', 'ah', 'distinctly', 'remember', 'it', 'was', 'in', 'bleak', 'december', 'each', 'separate', 'dying', 'ember', 'wrought', 'its', 'ghost', 'upon', 'floor', 'eagerly', 'wished', 'morrow', 'vainly', 'had', 'sought', 'to', 'borrow', 'from', 'books', 'surcease', 'sorrow', 'sorrow', 'for', 'lost', 'lenore', 'for', 'rare', 'radiant', 'maiden', 'whom', 'angels', 'name', 'lenore', 'nameless', 'here', 'for', 'evermore', 'silken', 'sad', 'uncertain', 'rustling', 'each', 'purple', 'curtain', 'thrilled', 'me', 'filled', 'me', 'with', 'fantastic', 'terrors', 'never', 'felt', 'before', 'so', 'now', 'to', 

In [44]:
# and then add this to our larger script

import re
import string
from collections import Counter

file_in = open('raven.txt', 'r') # create IO object in read mode
text = file_in.read() # apply .read() to the IO object and store the results as variable text
file_in.close()

lower_text = text.lower() # our new addition!

no_punc_lower_text = re.sub('[' + string.punctuation + ']', '', lower_text)

maybe_words = no_punc_lower_text.split()

words_to_omit = ['the', 'and', 'i', 'my', 'of', 'this', 'that', 'a', 'is', 'on']

words_i_want = []

for word in maybe_words:
    if word in words_to_omit:
        continue
    else:
        words_i_want.append(word)

counted_words = Counter(words_i_want) # I want to count this new list instead

print(counted_words.most_common(15))

[('door', 14), ('chamber', 11), ('nevermore', 11), ('bird', 10), ('raven', 10), ('me', 9), ('from', 8), ('more', 8), ('then', 8), ('with', 8), ('at', 8), ('in', 8), ('or', 8), ('lenore', 8), ('thy', 7)]


# Activity!

Work with your partner to expand the list of words to omit and explore the results on the data being generated.  Take about 10 minutes to explore this and we'll discuss which words you chose and why.

In [1]:
import re
import string
from collections import Counter

file_in = open('raven.txt', 'r') # create IO object in read mode
text = file_in.read() # apply .read() to the IO object and store the results as variable text
file_in.close()

lower_text = text.lower() # our new addition!

no_punc_lower_text = re.sub('[' + string.punctuation + ']', '', lower_text)

maybe_words = no_punc_lower_text.split()

words_to_omit = ['the', 'and', 'i', 'my', 'of', 'this', 'that', 'a', 'is', \
                 'on', 'me', 'from', 'then', 'with', 'at', 'in', 'or', 'by', 'he', 'but',\
                 'there', 'his', 'more', 'thy', 'but']

words_i_want = []

for word in maybe_words:
    if word in words_to_omit:
        continue
    else:
        words_i_want.append(word)

counted_words = Counter(words_i_want) # I want to count this new list instead

print(counted_words.most_common(15))

[('door', 14), ('nevermore', 11), ('chamber', 11), ('raven', 10), ('bird', 10), ('lenore', 8), ('said', 7), ('still', 7), ('nothing', 7), ('no', 7), ('above', 7), ('bust', 6), ('was', 6), ('to', 6), ('word', 6)]


# And we're done...

Hopefully now you're more comfortable with working inside of a scripting system, perhaps even working inside of Jupyter Notebooks.  If you were working in the notebooks, you won't need to have repeated so much, but if you're working inside of a script you'll be editing the same document.  

The process of developing a script will happen in the process of you develping your analyses.  Doing this inside of the jpyter notebook, as we've done, will help you narrate the choices and presumptions you've made.

What this hasn't covered is how to write out these results, but that's another conversation.  

# Bonus template: writing out the data

In [5]:
import csv

print_me = []

for word, count in counted_words.items():
#     print(word, count)
    print_me.append([word, count])
    
with open('word_counts.csv', 'w') as file_out:
    writeme = csv.writer(file_out)
    writeme.writerow(['word', 'count'])
    writeme.writerows(print_me)