# Humanities data: what is it really?

Humanities work can take many many forms, but usually involves text.  This text may be:

* derivatives from text, e.g. bag of words
* transcriptions of intererviews
* unstructured free text
* semi-strucutred data
* qualitative coded values
* lists of resources
* and many more

The data area where Python really shines is the manipulation of text structures.

## But what about R?

R is another language, focused on statistics.  It has great support for modeling, viz, etc.  Many text modeling/analysis tools are written in R.  So you may need to focus your efforts in learning R.  That said, the basic concepts that we're covered this morning and will look at this afternoon will carry over to working in R.  The structures in R are a bit different, so it isn't 1:1, but knowing the concept of a variable, the concept of a string, the concept of numerical types, etc. will serve you very well in the transition.

## You don't have to choose just one platform

Choosing the right tool for the right job is a key concept for programming.  Sometimes an entirely different language can save your a lot of trouble.  Also, the digital humanities crowd is pretty split between R and Python, so being functional in both will serve you well.

Python handles un-/semi-structured data very well, so you might find it easier to wrangle all your raw data in Python, then produce your analysis data file for processing in another program (e.g. produce coordinates for ArcGIS, network data for Gephi, categorical codes for R, etc.)

You don't need to pick and work with only one program.  You'll want to use many tools.


# Breaking down a problem

I want to focus more on design and give you some of the essential patterns to experiment with.  This is the best way to learn.  From a pratitioner standpoint, starting with the core patterns and tinkering with them allows you to get things done first and then you'll fill in the theory along the way.

### A basic problem:  Count the words in the Raven

In this problem we'll explore reading in text, manipulating it to get our data out, and then writing out data for ourselves.  Writing programs is a very iterative and non-linear process.  Much like starting off with an outline of a paper, you'll have clearer ideas on how to handle certain tasks that others.

Even though you've never seen this code, we can start somewhere!  We know all these things need to happen, even if we aren't sure how yet.

1. Read in the text
2. Count the words. I might need to toss out some stop words, but I don't know what those are yet.
3. Assemble some data
4. Make a data file where I have:  word, count

This is a great start, and pinpoints many of the problems that we need to solve.  I usually like to start filling in the easy chunks first, so I can focus on adapting the middle chunk.  Usually the juicy peices, e.g. "counting the words" appear as just a few lines in middle of a whole load of prep.

# Reading in text

There are multiple methods of reading in text, but I'm going to show you one of the classic methods.  We're very used to programs handling file opening, writing, editing, etc. for us, but when it comes to programming languages, we have to do that all ourselves.  We're used to opening and closing files within a GUI, but we'll be doing that more explicitly inside of our code.

`variable = open(filename, mode)` is our basic pattern for file handling.  The filename is just a string of the file path, and the mode is a string (https://docs.python.org/3.6/library/functions.html#open) to declare which mode you'd like to oepn the file as.  There are many, but you're going to usually use:

* 'r' for reading in a file
* 'w' for writing to a file
* 'a' for append (not the most common)

Once you're done doing what you need to a file, you'll want to run:  `variable.close()`.  Once you close a file you will no longer have read or write access to that IO object.

The more specific pattern for handling files is:

1. Create a file IO object in the desired mode
    * `file_in = open('raven.txt', 'r')`
    * `file_out = open('raven_data.txt', 'w')`
    * When writing out a file, if that file name does not exist it will create it.  If it does exist, in `'w'` mode it'll overwrite it, and `'a'` mode will add stuff to the end of it. 
2. Do your stuff to the file
    * We haven't learned these yet!
3. Close the file
    * `file_in.close()`
    * `file_out.close()`

## .read() the oldie but goodie

There are many methods of reading in a file, but .read() is sort of the "when unsure..." option.

In [1]:
file_in = open('raven.txt', 'r') # create IO object in read mode
text = file_in.read() # apply .read() to the IO object and store the results as variable text
file_in.close()

In [2]:
print(text)

Once upon a midnight dreary, while I pondered, weak and weary, 
Over many a quaint and curious volume of forgotten lore, 
While I nodded, nearly napping, suddenly there came a tapping, 
As of some one gently rapping, rapping at my chamber door. 
"'Tis some visitor," I muttered, "tapping at my chamber door- 
                Only this, and nothing more." 

Ah, distinctly I remember it was in the bleak December, 
And each separate dying ember wrought its ghost upon the floor. 
Eagerly I wished the morrow;- vainly I had sought to borrow 
From my books surcease of sorrow- sorrow for the lost Lenore- 
For the rare and radiant maiden whom the angels name Lenore- 
                Nameless here for evermore. 

And the silken, sad, uncertain rustling of each purple curtain 
Thrilled me- filled me with fantastic terrors never felt before; 
So that now, to still the beating of my heart, I stood repeating, 
"'Tis some visitor entreating entrance at my chamber door- 
Some late visitor entreating ent

In [3]:
text # running it like this better shows the raw content of the file. 

'Once upon a midnight dreary, while I pondered, weak and weary, \nOver many a quaint and curious volume of forgotten lore, \nWhile I nodded, nearly napping, suddenly there came a tapping, \nAs of some one gently rapping, rapping at my chamber door. \n"\'Tis some visitor," I muttered, "tapping at my chamber door- \n                Only this, and nothing more." \n\nAh, distinctly I remember it was in the bleak December, \nAnd each separate dying ember wrought its ghost upon the floor. \nEagerly I wished the morrow;- vainly I had sought to borrow \nFrom my books surcease of sorrow- sorrow for the lost Lenore- \nFor the rare and radiant maiden whom the angels name Lenore- \n                Nameless here for evermore. \n\nAnd the silken, sad, uncertain rustling of each purple curtain \nThrilled me- filled me with fantastic terrors never felt before; \nSo that now, to still the beating of my heart, I stood repeating, \n"\'Tis some visitor entreating entrance at my chamber door- \nSome late v

## .readlines() for more of a delicate touch

The other main method of reading in text is .readlines(), which does a lot of stuff for you.  But you do need to use it more purposefully.  We know that the text is organized by lines, and there are newlines aplenty.  .readlines() will create a list where each line of text is a separate string within that list.

So a text that looks like:

```
I am a meat popsicle.
I live to be eaten.
My soul is frozen.
There is a stick.```

Will be turned into:

```
[ ['I am a meat popsicle.\n'], 
  ['I live to be eaten.\n'],
  ['My soul is frozen.\n'],
  ['There is a stick.\n'] ]
```

Sometimes you want this structure because your plan is to do stuff line by line.  But somethings you just want access to the entire document in a single flat structure.