# File input and output


## File input

The `open()` function is used to open files. When you open a file, you assign handle or placeholder for that filename to a variable which can be referred to when file related functions or methods needed. For example, to read a line from a file, you open the file, assign a variable with handle and then call read line method on that handle.

`f.read(size)` reads some quantity of data and returns it as a string where size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned.

In [1]:
f=open("data/jane-austen-emma.txt")
# file.read() function will print whole book
f.readline()

'The Project Gutenberg EBook of Emma, by Jane Austen\n'

As noted above, `read()` method will retrieve the contents of the whole file at once. For large files, that might be problematic. Thus `readline()` method can be used to retrieve single line. We printed the first line of the file.

In [2]:
f.readline()

'\n'

Did you notice that second call to `readline()` printed second line in the file? `readline()` retrieves the line that is next in line. The file object tracks a position in the file and `readline()` call will push the object position to next line in the file.

One way to access all the content of a file is to loop over the lines and that is the most memory efficient approach. Because, at each iteration, single line is retrieved and processed. In the example below, a large file is read line by line, instead of printing contents (and filling up the screen) we are counting number of lines.

In [3]:
# first approach
f=open("data/jane-austen-emma.txt")
count=0
for line in f:
    # print line
    count += 1
print(count)
f.close()

16633


Second approach is to use `readlines()` method which will retrieve each line from the file and assign it to a list. The advantage is that, you can access any line via index (even after file is closed). However, there's a disadvantage in this approach, if the file is big, then the list will take up too much space in the memory.

In [4]:
# second approach
f=open("data/jane-austen-emma.txt")
lines = f.readlines()
f.close()

When you're done with a file, call `f.close()` to close it and free up any system resources taken up by the open file.

As mentioned above, with `readlines()` you can keep the contents of a file in a list which is accessible via index (or slice) even after the file is closed.

In [5]:
lines[0:9]

['The Project Gutenberg EBook of Emma, by Jane Austen\n',
 '\n',
 'This eBook is for the use of anyone anywhere at no cost and with\n',
 'almost no restrictions whatsoever.  You may copy it, give it away or\n',
 're-use it under the terms of the Project Gutenberg License included\n',
 'with this eBook or online at www.gutenberg.org\n',
 '\n',
 '\n',
 'Title: Emma\n']

Let's do a word count using the list of lines. In order to get more accurate results we should count lowercase of words.

In [6]:
freq = {}
for line in lines: 
    for word in line.split():
        word = word.lower()
        freq[word] = freq.get(word,0)+1

Let's view 10 words and their counts

In [7]:
words = list(freq.keys())[0:9]
for w in words:
    print("%s: %d" % (w,freq[w]))

the: 5269
project: 84
gutenberg: 24
ebook: 9
of: 4312
emma,: 167
by: 574
jane: 200
austen: 4


### `line` versus `lines`

There's a big difference between 
* lists
* generators/iterators

In lists, there's direct access to elements with indexes. However this comes with a cost. The list will take up space in memory.

In generators, usually there's no direct access, the elements are generated on demand. Thus, they don't use memory space to store all elements.

Let's see the size of `lines` list in memory


In [8]:
import sys
sys.getsizeof(lines)

133344

We can count without using any memory by `line` generator:

In [9]:

f=open("data/jane-austen-emma.txt")
another_dict = {}
for line in f: 
    for word in line.split():
        another_dict[word.lower()] = another_dict.get(word.lower(),0)+1
f.close()

list(another_dict.items())[0:9]


[('the', 5269),
 ('project', 84),
 ('gutenberg', 24),
 ('ebook', 9),
 ('of', 4312),
 ('emma,', 167),
 ('by', 574),
 ('jane', 200),
 ('austen', 4)]

How many lines, words and unique words are there?

In [10]:
uniq_words = len(freq.keys())
total_words = sum([len(line.split()) for line in lines])


In [11]:
template= "In the novel, there are %d lines \
and total of %d words are used. Number of \
unique words is %d"

print(template % (len(lines),total_words,uniq_words))

In the novel, there are 16633 lines and total of 160458 words are used. Number of unique words is 17460


## Output to files

Open file for writing. 

> Be aware, `w` mode will overwrite existing file!

In [12]:
f = open('data/test.txt', 'w')

A file named `test.txt` has been opened under `data` folder. `f` is the file object. There are various ways to access and write to file.

In [13]:
f.write("Hello world")

11

Let's check and see the contents of the file.

Why is it empty?

In [14]:
f.close()

> We discussed this last week. The contents to files are not written/saved imediately to file on disk.

You can also write to a file by `print` function, with `file=` argument within.

In [15]:
f = open('data/test2.txt', 'w')
print("Second hello..", file=f)
print("to the screen")
print("to the file", file=f)
f.close()

to the screen


Now, more serious example:

In [16]:
# source: https://scipython.com/book/chapter-2-the-core-python-language-i/examples/writing-numbers-to-a-file/
f = open('data/powers.txt', 'w')
for i in range(1,1001):
    print(i, i**2, i**3, i**4, sep=', ', file=f)
f.close()

Other modes for reading/writing files:

* **w** : Write mode. If file does not exist, it creates a new file. *But*, if file exists it truncates the file.
* **a** : Append mode, add lines to file (If file does not exist, it creates a new file)
* **x** : Creates a new file. If file already exists, the operation fails.
* **r** : Read mode

Let's read the cubes data from the file. We'll be collecting data from 3rd column.

In [17]:
f = open('data/powers.txt', 'r')
cubes= []
for line in f.readlines():
    fields = line.split(',')
    cubes.append(int(fields[2]))
f.close()
n = 5
print(n, 'cubed is', cubes[n-1])

5 cubed is 125


In [18]:
len(cubes)

1000

In [19]:
cubes[1:5]

[8, 27, 64, 125]