<a href="https://colab.research.google.com/github/albertomanfreda/intensive_school_ml/blob/master/Lesson_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# File handling

## Text files and binary files

A **text file** is a computer file that only contains plain text - that is a sequence of binary representations of characters, in some **encoding**. The term is often used in opposition to **binary file**, which may contain any kind of binary sequence. Images, audio, executables etc, are all binary files in this respect. Microsoft Office / Libre Office documents (e.g. Word) or pdf files are also **not** text files, even thought they contain text. That's because they always also store other informations about formatting, page structures and so on. Conversely, a plain text file cannot use special formatting, such as bold or italic, nor can it host tables or images.

Conventionally, text files have the *.txt* extension, though that is by no mean a rule (nor a guarantee). Source code files, like Python scripts with the *.py* extension or C code files with the *.c* extension, are also text files. Another common extension is *.csv* (Comma-separated Values).

Mind that, even when a *.txt* or *.csv* file contains numbers, their internal representation is much different from what it would be in a binary file, as each digits of the number is represented as a distinct text character, with the corrisponding bit sequence. For example, in a text file with ASCII encoding,the number 17 will be represented as the encoding for number 1 (00011111), followed by that for number 7 (00100101). This is much different from the binary representation of the integer number 17 (00010001). Of course, the same difference exists between the string '17' and the integer number 17: they are very different objects in memory!

Python supports operation on both text or binary files. However we will mostly talk about text files, here.

**Note**: There are many possible encodings for text files (ASCII, UTF-8, ANSI, UTF-16, etc...) and different OS uses different default encodings. On top of that, there is also the issue of **endianess**, that is the order of bytes inside a word of multiple bytes. Again, different devices (hard disk, network, etc...) may use different endianess. Discussing these kind of crazyness will take hours, but hopefully you need not to care about that most of the time, unless you are working with files produced by different systems. In that case, be ready to bite the bullet.


## Opening and closing files

They can be opened in Python with the *open()* function, which takes as input the file path and a *mode* specifier, which is one of the following:

|  |  |
|-----|-----|
| "r" | open for reading (default), returns an error if the file does not exist|
| "a" | open for writing, appending to the end of the file if it exists, creating it if not|
| "w" | open for writing, creates the file if it does not exist|
| "x" | open for exclusive creation, returns an error if the file exists|
| "+" | open for updating (reading and writing)|


In addition you can specify if the file should be handled as binary or text mode:

|  |  |
|-----|-----|
| "t" | Text - Default value. Text mode
| "b" | Binary - Binary mode

*open()* returns a handle to the file, which you can use to operate on it.

In [34]:
# Open a text file in 'write' mode (creates it the first time)
# Note: if the file exists it will be overwritten
new_file = open('my_file.txt', 'wt')
# Write some text to it
new_file.write('Welcome to the Intensive School for advanced Graduate Studies - Machine Learning\n')
new_file.write('This is a sample line\n')
new_file.write('This is another sample line\n')
# Close it
new_file.close()

In [33]:
# Reopen it in 'read' mode
prev_file = open('my_file.txt', 'r')
# Read will read the specified number of bytes
print(prev_file.read(7))
print(prev_file.read(7))
# See where we are in the file (byte position)
print(prev_file.tell())
# Go to a specific position (in this case right after the first character)
prev_file.seek(1)
# Read all charcters until the newline 
print(prev_file.readline())
# Read all the remaining lines: return a list of strings, one per line
print(prev_file.readlines())
#Close the file
prev_file.close()

Welcome
 to the
14
elcome to the Intensive School for advanced Graduate Studies - Machine Learning

['This is a sample line\n', 'This is another sample line\n']


Files in python supports for-loop iteration line by line

In [37]:
prev_file = open('my_file.txt')
for line in prev_file:
    print(line)
prev_file.close()

Welcome to the Intensive School for advanced Graduate Studies - Machine Learning

This is a sample line

This is another sample line



## Context managers

Having to remeber to close the file all the time is annoying. Also, what happens if an error occurs while we are operating on the file? The close line will never gt executed! Although eventually Python will release the resources for us in any case, there is a much safer way to control exactly when a file is closed: the **with** statement. A construct like **with**, that executes automatically some operations when entering and exiting a code block, si called a **context manager**.

In [None]:
""" The file is automatically closed at the end of the indented block,
no matter what."""
with open('my_file.txt', 'r') as my_file:
    print(my_file.readline())

# No need to close the file!

One-line summary: you should always use **with** when working with a file.