# Getting Texts

We are first going to experiment with loading a plain text into memory from the [Gutenberg Project](http://gutenberg.org), an online library with tens of thousands of free texts in different languages and formats. We can Google something like python3 read from url to discover pages like https://docs.python.org/3/howto/urllib2.html that explain the basics of reading content.

In [2]:
import urllib.request
poeUrl = "http://www.gutenberg.org/files/2147/2147-0.txt"
poeString = urllib.request.urlopen(poeUrl).read().decode().strip()
print("This string has", len(poeString), "characters")

This string has 550321 characters


In [3]:
poeStringLen = len(poeString)
poeStringLenFormatted = "{:,}".format(poeStringLen) # format mini-language
print("This string has", poeStringLenFormatted, "characters")

This string has 550,321 characters


In [4]:
poeString[:25]

"Project Gutenberg's The W"

In [5]:
print("First character:", poeString[0])
print("Last character", poeString[-1])
print("First 25 characters:", poeString[:25])
print("Last 25 characters:", poeString[-25:])
print("Characters 8 to 25:", poeString[8:30])

First character: P
Last character .
First 25 characters: Project Gutenberg's The W
Last 25 characters: to hear about new eBooks.
Characters 8 to 25: Gutenberg's The Works 


In [6]:
print("Occurrences of 'corpse':", poeString.count("corpse"))


Occurrences of 'corpse': 65


In [7]:
print("Occurrences of 'corpse':", poeString.count("corpse"))
print("Occurrences of 'corps':", poeString.count("corpse"))
print("Occurrences of 'Corpse':", poeString.count("Corpse"))

Occurrences of 'corpse': 65
Occurrences of 'corps': 65
Occurrences of 'Corpse': 0


In [8]:

print("Occurrences of 'CORPSE':", poeString.upper().count("CORPSE"))

Occurrences of 'CORPSE': 65


In [9]:
firstCorpus = poeString.find("corpse") # the index position of the first occurrence of "corpse"
context = 30 # number of characters to show on either side of the index position
print(poeString[firstCorpus-context : firstCorpus+context])

and (horrible to relate!) the corpse of the daughter, head



In [10]:
start = poeString.find("THE GOLD-BUG")
end = poeString.find("FOUR BEASTS IN ONE")
goldBugString = poeString[start:end].strip()
# show start and end of goldBugString
print(goldBugString[:50], "[…] ", goldBugString[-50:])

THE GOLD-BUG

          What ho! what ho! this f […]  it; perhaps it
required a dozen--who shall tell?"


In [11]:
import os
directory = "data"
if not os.path.exists(directory):
    os.makedirs(directory)

In [12]:
with open("data/goldBug.txt", "w") as f:
    f.write(goldBugString)

In [13]:
with open("data/goldBug.txt", "r") as f:
    goldBugString2 = f.read()

In [14]:
print(goldBugString2[:50], "[…] ", goldBugString2[-50:])

THE GOLD-BUG



          What ho! what ho! this f […]  it; perhaps it

required a dozen--who shall tell?"


In [15]:

goldBugString == goldBugString2 # are these two strings the same?

False

In [16]:
import glob
textFiles = glob.glob("data/*txt")
textFiles

['data\\goldBug.txt']

In [17]:

type(textFiles)

list

In [18]:
totalCharacters = 0
for textFile in textFiles:
    f = open(textFile, "r")
    textString = f.read()
    f.close()
    chars = len(textString)
    print(textFile, "has", chars, "characters")
    totalCharacters += chars
print("total characters: ", totalCharacters)

data\goldBug.txt has 77916 characters
total characters:  77916


## On to the questions: 
- how would you create a subdirectory called Austen under the data directory we've already created?
- for each of the plain text novels in English of Jane Austen in Project Gutenberg
    - how would you isolate the text content (without the Project Gutenberg header and footer)?
    - how would you save the text-only content into the data/Austen directory?
- how would you loop over the files in the data/Austen directory and for each one print the file name and a count of "his" and "her"?
- what is the total number of characters in the Austen corpus?

There must be a more efficient way to grab all the Jane Austen texts at once, rather than doing them one by one, but for the sake of practice (and until I ask Stefan, I'll do two (P&P and Emma) and see if I can figure out how to do at least some basic looping.

First, Pride and Prejudice

In [19]:
import urllib.request
austenUrl= "http://www.gutenberg.org/cache/epub/1342/pg1342.txt"
austenString = urllib.request.urlopen(austenUrl).read().decode().strip()
print("this string has", len(austenString), "characters")

this string has 717570 characters


In [59]:
start = austenString.find("Chapter 1")
end = austenString.find("End of the Project")
prideString = austenString[start:end].strip()
print(prideString[:50], "[...]", prideString[-50:])

Chapter 1


It is a truth universally acknowled [...] to Derbyshire, had been the means of uniting them.


In [60]:
with open("data/pride.txt", "w") as f:
    f.write(prideString)

In [61]:
with open("data/pride.txt", "r") as f:
    prideString2 = f.read()

In [62]:
print(prideString2[:50], "[...]", prideString2[-50:])

Chapter 1





It is a truth universally acknowled [...] to Derbyshire, had been the means of uniting them.


Okay, now for Emma

In [64]:
import urllib.request
emmaUrl= "http://www.gutenberg.org/cache/epub/158/pg158.txt"
emmaString = urllib.request.urlopen(emmaUrl).read().decode().strip()

In [67]:
start = emmaString.find("VOLUME I")
end = emmaString.find("End of the Project")
trueEmmaString = emmaString[start:end].strip()
print(trueEmmaString[:50], "[...]", trueEmmaString[-50:])

VOLUME I



CHAPTER I


Emma Woodhouse, han [...] n the perfect happiness of the union.



FINIS


In [68]:
with open("data/emma.txt", "w") as f:
    f.write(trueEmmaString)

In [69]:
with open("data/emma.txt", "r") as f:
    emmaString2 = f.read()

In [70]:
print(emmaString2[:50], "[...]", emmaString2[-50:])

VOLUME I







CHAPTER I





Emma Woodhouse, han [...] n the perfect happiness of the union.







FINIS


Okay, I have the files saved, I think. Let's just check - 

In [82]:
import glob
textFiles = glob.glob("data/*txt")
textFiles

['data\\emma.txt', 'data\\goldBug.txt', 'data\\pride.txt']

Let's just say, hypothetically, that goldBug isn't there. Next step is to somehow loop these files (ruh-roh)

In [80]:
totalCharacters = 0
for textFile in textFiles:
    f = open(textFile, "r")
    textString = f.read()
    f.close()
    chars = len(textString)
    print(textFile, "has", chars, "characters")
    totalCharacters += chars
print("total characters: ", totalCharacters)

total characters:  0
