# Input and Output

The Python programs that you wrote so far, were using small amounts of fake data directly written into the program itself.

In order to deal with real data it is fundamental to be able to let your Python program to interact with other files.
You will have different types of text files containing your input data. This data has to be read by your Python program in order to be processed.
Similarly, the processed output should not be simply shown on the screen using the `print()` **function**, but rather it should be saved to a new file.

Python provides built-in **functions** for doing all of this.

### Escape characters

**Escape characters** are special symbols used by Python to represent more complex text information than just characters.

Every line of code in Python contains one instruction to be executed. If you split an instruction into two lines, it will result in an error.

In [None]:
x = 1 +
2

This may be a problem if you want to write a **string** of text that contains multiple lines.

In [None]:
x = "hello
world"

Python provides the special character `"\n"` to represent a "new line" in your **strings**. This is called an **escape characters**: it's a backslash `\` followed by a regular character.

**Escape characters** are recognized by Python as special and treated accordingly.
Note how the `"\n"` is exactly replaced by a new line when you print it.

In [None]:
print("hello\nworld")
print("escaping \n characters")

New lines are not the only **escape characters** in Python.
There are many more and they are used to easily represent complex text.

Each **escape character** is exactly replaced independently from the adjacent characters. You can also have one **escape character** after the other.

In [None]:
x = "This is a \ttab"
print(x)

x = "Mixing t and tabs t\t\tt\t"
print(x)

x = "These are quotes \' \""
print(x)

x = "This is an\n\tindented text on a new line"
print(x)

x = "And finally backslashes \\\\"
print(x)

### Reading from a file

The simplest thing that you can do with a file is to read its content and use it in your program.

In this directory there is a file named `dna.txt` that contains several DNA sequences.
Let's open it and print its content.

In [None]:
file_path = "dna.txt"
with open(file_path, "r") as f:
    x = f.read()
    print(x)

First of all you have to specify the file path, i.e. where Python can find your file of interest.
In this case, specifying its name is enough as it is in the same directory as this file that you are executing.

Then there is a Python statement that you have never seen so far, but it's very helpful when working with files.
The `with` statement ends with the colon `:` and its body is indented.
The reason why we use this statement is that, after a file is opened, it also must be closed or strange problems may occur.
Unfortunately, closing a file is often forgotten.
The statement will automatically close the file at the end of its body.

This means that the block above is equivalent to the following

    f = open(file_path, "r")
    x = f.read()
    print(x)
    f.close()
    
As you can see, `open() as f` is equivalent to assigning the opened file to the variable `f`.
Another advantage of the `with` statement is that allows to immediately identify where a file is closed, i.e. at the end of the indented block. 

`f` is an object of **class File** and the `open()` function allows to create it.
The `open()` function takes 2 input argument: the path to a file and a character defining its "mode". In this case, the `"r"` character denotes that you want to only read the file.

Within the statement body, you can use the **method** `read()` provided by the **class File**.
This **method** will read the entire file and store it into a variable.

Files are usually made of multiple lines and you may want to keep them separate. The **method** `readlines()` provided by the **class File** returns us a **list** where each element is a **string** corresponding to a particular line of the file.

In [None]:
file_path = "dna.txt"
with open(file_path, "r") as f:
    x = f.readlines()
    print(x)

### File paths

In order to use a file in your program you have to specify the so called file path.
The file path is a string that tells Python which file you want to work with.
In the simplest case, if the file is in the same directory as your Python program, it's sufficient to provide the name of the file as file path.

If the file is not in the same directory as your Python program, things are a little bit more complicated.
You have 2 alternative ways for specifying the path:
 - The **absolute path** is a path that starts from the root directory of your computer and traverses a bunch of directories until it reaches the desired file. It will be something like `/home/username/Downloads/filename.txt`. You can find the absolute path of a file by opening a terminal in the directory where the file is and executing the command `realpath FILENAME`, or with a right click on the file and checking its information. An absolute path in Linux always starts with the slash `/` symbol.
 - The **relative path** is a path that starts from the directory where your Python program is. For example if you have a directory named `files` that contains your file named `filename.txt`, the relative path will be `files/filename.txt`. Note how there is no slash `/` symbol at the beginning. 

### The `strip()` method

When executing the blocks of code above, you can notice two things: the `read()` **method** returned you a multiline **string**, on the other hand the `readlines()` **method** returned you a **list** of lines where all the lines except the last one will have a `\n` symbol at the end.

This symbol denotes a new line and since you already splitted the file into different lines is most of the times just noise.

The **class str** provides a convenient **method** for quickly eliminating these characters: `strip()`. This **method** will eliminate whitespaces, new lines or similar characters from the beginning and the end of a string.

In [None]:
def show_stripped(text):
    print("The text was: '" + text +"'")
    stripped_text = x.strip()
    print("The stripped text is: '" + stripped_text +"'")

show_stripped("hello world ")
print("----")
show_stripped("     hello world        ")
print("----")
show_stripped("\n hello world \n\n\n")

If you try to open in read mode a file that does not exist, you will get an error.

In [None]:
file_path = "my_file.txt"
with open(file_path, "r") as f:
    print("file opened")

### Exercise

Define a function that takes as input a file path and returns a list of stripped lines.
Test it on the `dna.txt` file.

Hint: strings are immutable, that's why the strip method does not modify the object it is called on, but rather it returns a new, stripped, string.

### Writing to a file

The syntax for writing to a file, is very similar to the one for reading.

When reading, you opened a file in mode `"r"`.
There are two different modes that allow to write to a file:
 - write mode `"w"` will first erase the content of the file
 - append mode `"a"` will allow to keep writing at the end of a file while preserving its original content
 
Note that when you try to open a file in mode `"w"` or `"a"`, if the file does not exist it will be automatically created.

In [None]:
file_path = "newfile"

print("**** Opening in write mode:")
with open(file_path, "w") as f:
    f.write("hello ")
    f.write("world\n")
    f.write("some text in one line\nsome text in another line")
    x = "some interesting text"
    f.write("\n" + x)

print("**** Opening in read mode")
with open(file_path, "r") as f:
    print(f.read())

print("**** Opening in write mode:")
with open(file_path, "w") as f:
    f.write("hello ")
    f.write("world")

print("**** Opening in read mode:")
with open(file_path, "r") as f:
    print(f.read())
    
print("**** Opening in append mode:")
with open(file_path, "a") as f:
    f.write("\n")
    x = "some interesting text"
    f.write(x)

print("**** Opening in read mode:")
with open(file_path, "r") as f:
    print(f.read())

After you open a file in write or append mode, you can use the `write()` method to write some text at the end of it.
Note that you must take care of adding new lines using `"\n"` when you need it.

### Exercise

Open the file `"ex_1.txt"`, which contains one number in each line.
Create 2 new files named `"ex_1a.txt"` and `"ex_1b.txt"`, with the first one containing the numbers of the original file that are smaller than 500 and the other containing the ones that are bigger.

Hints:
 - Remember to use strip when reading multiple lines.
 - You will have to convert lines from being strings to integers in order to be able to use comparison operators with them.

### CSV and TSV files

The acronyms **csv** and **tsv** stand for Comma Separated Values and Tab Separated Values.
They identify what is probably the most common format of files in data science fields.
An popular example of such files is what you generate from Excel or similar editors.


So, what is a **csv** (or a **tsv**)?
It's a raw text file representing a table.
Since we are dealing with raw text files, it is not convenient to represent this table using a grid, as it's done within the Excel program, but rather special characters delimit rows and columns.
Each line of the text file represents a row of the table, i.e. the  newline escape characters `\n` is the delimiter between rows.
Each row contains multiple values that can be separated by any particular value. Common choices are commas `,` or tabs `\t`. That's where the names for these file formats come from,

Note that the term **csv** is also used for any generic grid representation: if, according to your convention, columns of values are separated by a semicolon `;` or by the word `SEPARATOR`, this is still a valid CSV file and it can be easily processed by Python.


In this directory you can see an example of a **csv**: the file named `data.csv`.

Let's now see how to read **csv** files in Python.

In [None]:
import csv

with open('data.csv', "r") as f:
    x = csv.reader(f, delimiter=',')
    for row in x:
        print(row)

As you can see, first of all it's necessary to `open` the file, exactly as it was done before.

In order to read from the file, it is necessary to create a reader object.
This can be done using the `csv.reader()` function provided by the `csv` module.
Since we are using a module, do not forget `import csv` at the beginning of your code!
The `csv.reader()` function takes as input the opened file and the delimiter used to separate columns. You may notice that the delimiter parameter is passed in a particular way (i.e. by specifying `delimiter=` before the value). There is nothing magic about it: it's just a more advanced way for specifying input parameters that serves 2 scopes: it allows you to easily visualize what that second parameter corresponds to and also helps Python in understanding how to use it.

What is returned from `csv.reader()` function is an iterable object, i.e. you can use a `for` loop on it.
Each iteration of the loop, the loop variable will be assigned to a different row of the file.
The row is represented as a list of strings.


More details on the `csv` module can be found in its documentation
https://docs.python.org/3/library/csv.html

### Exercise

Open the `data.csv` file.
In this file the first element of each row represents the name of a sample, while all the other values in the row are results of different measurements on this sample.
Find the sample where the mean of observed values is the highest.

Hint: note that all the csv reader gives you a list of strings. Values will have to be converted to numbers, while the sample name does not.

### CSV DictReader

As you may have noticed, the `data.csv` file used above had a relevant problem: columns where not labeled.
This required some prior knowledge about what the value of each column means.

The first row of a csv file can be used as an header, i.e. it provides labels for all the columns.
This allows to represents the csv as many dictionaries instead of lists.
Each dictionary represents a row where keys are all the labels and values are all the elements in that row.

In order to deal with csv with headers, a different function should be used: `csv.DictReader()`.
This function has capital letters, so it's a constructor for an object of type `DictReader`.

In [None]:
import csv

with open('data_with_header.csv', "r") as f:
    x = csv.DictReader(f, delimiter=';')
    for row in x:
        print(row['sample'])

### Exercise

The `data_with_header.csv` file contains the same data as `data.csv`, with the addition of an header as first row.

Find the sample for which the following expression has the smallest value: `x + z/y`.

Hint: note that the file uses `;` as separator.

### Exercise

Download the `Processed nylon microarray data` from https://bioinformatics.mdanderson.org/Supplements/Datasets/Threeway/index.html (get both the normalized array data and the annotations).

Load both files in Python.

Remove from the normalized array data all the entries for which there is not a corresponding annotation.