# File streams

##  Downloading the data for this note book

The data source is `www.gutenberg.com`, which has many open domain texts.

We are going to download *Pride and Prejudice*, Jane Austen's second novel,
published in 1813.

Skip this section if you already have the data  (`pride_and_prejudice.txt`) downloaded.

In [11]:
import urllib

url = "https://www.gutenberg.org/cache/epub/1342/pg1342.txt"
url_channel = urllib.request.urlopen(url)

bts = url_channel.read()

print(bts[:200])

b'\xef\xbb\xbfThe Project Gutenberg eBook of Pride and Prejudice\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrict'


In [10]:
type(bts)

bytes

Whoa!  This is not a string!  It's undecoded bytes!  And that's why nothing nice happened when I evaluated the `print` function.  Let's convert to a string using the proper encoding (which is advertised on `gutenberg.com`, though it's hardly surprising.)

In [14]:
print(bts.decode("UTF8")[:200])

﻿The Project Gutenberg eBook of Pride and Prejudice
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictio


I believe Google Colab will allow you to do the following, so we can practice having a file system in the rest of this notebook:

In [15]:
#  Create the data file in the local notebook directory.
with open("pride_and_prejudice.txt","w") as ofh:
    ofh.write(bts.decode("UTF8"))

##  Loading the data file into the notebook

For the next few lines of code to work for you, you need to have downloaded a copy of *Pride and Prejudice*,  from Project Gutenberg, for example, as done at the beginning of this notebook, and placed it in the directory where  you keep your notebook.  You may then need to edit line 5 in the code below to reflect the actual location  of the file.

In [16]:
from collections import Counter
ctr = Counter()
token_ctr = 0

with open('pride_and_prejudice.txt','r') as file_handle:
     for line in file_handle:
         line_words = line.strip().split()
         for word in line_words:
             token_ctr += 1
             ctr[word] += 1

In [17]:
token_ctr

130410

In [18]:
with open('pride_and_prejudice.txt') as fh:
    file_str = fh.read()

In [20]:
type(file_str)

str

In [21]:
print(file_str[1000:1500])

           PUBLISHER

                        156 CHARING CROSS ROAD
                                LONDON

                             RUSKIN HOUSE
                                   ]

                            [Illustration:

               _Reading Jane’s Letters._      _Chap 34._
                                   ]




                                PRIDE.
                                  and
                               PREJUDICE

                                  by
             


## The data

In [22]:
len(file_str)
#print file_str[:1000]

748126

In [23]:
len(ctr)

14703

In [24]:
token_ctr

130410

The numbers you get may differ slightly from those I get because I am using a slightly modified
version of the file, but your numbers should be fairly close.

In [25]:
ctr.most_common(10)

[('the', 4509),
 ('to', 4275),
 ('of', 3899),
 ('and', 3443),
 ('a', 2021),
 ('in', 1923),
 ('her', 1905),
 ('was', 1817),
 ('I', 1764),
 ('that', 1458)]

In [26]:
ctr['Elizabeth']

395

In [27]:
ctr['Lizzy']

13

In [28]:
ctr['Darcy']

217

This is an interesting statistic.  Darcy's mention count is a little over half that of Elizabeths's.
Those who know the book will remember that Elizabeth is in many more scenes
than Darcy.  The fact that Darcy's mention count is over half that of Elizabeth's either means that he is entioned more often in the scenes where he's present of that he is being mentioned more often in a lot of scenes in which he isn't present.  It would be interesting to know what percentage of "Darcy" mentions
are made in scenes in which he is absent.  And of course it would also be interesting to know the number of scenes 
in which Elizabeth is mentioned but is absent, which is no doubt much lower.  But if you think about it, this is 
going to be a hard number to come up with automatically.  In fact, just segmenting the text into scenes is no trivial
task for a computer,  This is just a very simple example of the kind of information that it is difficult to
extract from text.

## Looing at a file stream:  an iterator

In [36]:
fh = open('pride_and_prejudice.txt','r')

The file stream `fh` is what is called an **iterator**.

In [37]:
fh

<_io.TextIOWrapper name='pride_and_prejudice.txt' mode='r' encoding='UTF-8'>

An `iterator` has a state and the state maintains a **next element**.
The elements of a text file iterator are **lines** in the file.

In [38]:
next(fh)

'\ufeffThe Project Gutenberg eBook of Pride and Prejudice\n'

In [39]:
next(fh)

'    \n'

In [40]:
next(fh)

'This ebook is for the use of anyone anywhere in the United States and\n'

Every iterator is also an iterable.  Meaning you can loop through it:

In [41]:
for line in fh:
    print(line)
    break

most other parts of the world at no cost and with almost no restrictions



Strings are also iterables.  The lines returned when we loop thtough the file
are strings and therefore iterables.  But strings are not iterators.  They have no
`next` method and therefore the `next` function raises a `TypeError`:

In [34]:
for line in fh:
    next(line)

TypeError: 'str' object is not an iterator

Let's loop through the rest of the file without doing anything to the lines:

In [42]:
for line in fh:
    line

`Iterators` are resources and can be exhausted. We've looped through all the states.
There's no next thing.

In [43]:
next(fh)

StopIteration: 

## Writing information to files

In [None]:
word_count = dict(walk=3,talk=1,the=422)

In [6]:
word_count

{'talk': 1, 'the': 422, 'walk': 3}

In [15]:
ofh = open('counts.txt','w')
for (word,count) in list(word_count.items()):
    print(word, count, file=ofh)
ofh.close()