# Lecture 04: Input and output

## Chapters
Chapter 6: Storing and manipulating data <br>
Author: Jurre Hageman

## Overview

For this lesson, we will write a small program that implements the command-line program "cat".
cat is a frequently used command on Unix-like operating systems. It can display text files, combine copies of different text files and create new text files.

We will write a small Python program that mimics the text display feature of the cat command. <br>
In addition, we will write a program that reads DNA from file, makes it reverse complement and writes the output to a text file

## A Python Cat Clone

cat is a small Unix utility that can display (among other things) text files. cat is written in C and the source code can be found <a href="https://git.savannah.gnu.org/cgit/coreutils.git/plain/src/cat.c">here</a>. <br>
Here you can see cat in action:

In [1]:
!cat input.txt

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Here cat opens a file stream and displays the contents on screen. The file is an ASCII file in FASTA format and contains the sequence of the cytochrome b protein from the elephant.

Unix and Windows use different type of line endings. While Unix like systems (such as Mac and Linux) use Line Feed (LF), Windows uses Cariage Return and Line Feed (CRLF). You can check the type of line ending using the file command (without the ! character):

In [2]:
!file input.txt

input.txt: ASCII text


Now we check another file:

In [3]:
!file input_windows_endings.txt

input_windows_endings.txt: ASCII text, with CRLF line terminators


As you can see, this file has CRLF endings. You can convert between these endings using dos2unix and unix2dos. See: https://sourceforge.net/projects/dos2unix/

 Back to the excersise: We will use Python to make a cat clone.

Let's first think of what the program should do:
- it should open a file
- it should loop through every line
- it should print the content of the line to the screen
- it should close the file

## Read file

As a start, we will write some code to print multiple items using a loop.
There are different types of loops. 
The for loop in Python is used to avoid repetition of code.
Suppose we have a collection of different items stored. We can store them in a list:

In [4]:
items = ["This", "is", "a", "list", "with", "each", "item", "being", "a", "string"]
print(items)

['This', 'is', 'a', 'list', 'with', 'each', 'item', 'being', 'a', 'string']


Like a list, a text file is also a collection of items. It is a collection of lines containing characters.
To loop through the collection we can use a for loop. In this way, each item will be printed:

In [5]:
items = ["This", "is", "a", "list", "with", "each", "item", "being", "a", "string"]
for item in items:
    print(item)

This
is
a
list
with
each
item
being
a
string


Item is now a placeholder that is overwritten for each consecutive loop. The first loop item refers to "This", the second loop to "is" etc.
Note that item (singular) differs from items (plural). Items refers to the complete list while item is the placeholder in the for loop refering to each item of the list. 

The next thing we would like to do is to open a file.
Files can be opened in Python using the open command.
We have a text file in the directory that is named <a href="HIER DE GOEDE LINK">input.txt</a>. Make sure that you download this file in your working directory (the same directory in which you store your python file). 
Using the following code we can generate a file object with the open file:

In [6]:
file_object = open("input.txt")
print(file_object)

<_io.TextIOWrapper name='input.txt' mode='r' encoding='UTF-8'>


Printing the file object gives a bit of intimidating result. Do not let it intimidate you. It's just a file object in read mode with a certain encoding that can display the content.
To display the content we can loop through the file object in the following manner:

In [7]:
my_file = open("input.txt")
for line in my_file:
    print(line)
my_file.close()

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]

LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV

EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG

LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL

GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX

IENY


The keyword "line" is now a placeholder for every line of the file. The first loop it refers to the first line, the second loop to the second.
Note that a line break is introduced between each line. This is caused by the print function which normally introduces a line break after each printing event. This can be avoided by using the end='' statement:

In [8]:
mssg1 = "Hello"
mssg2 = "World"
print(mssg1)
print(mssg2)
print(mssg1, end='')
print(mssg2, end='')

Hello
World
HelloWorld

To completely mimic the display feature in Python, the line breaks between lines should be avoided: 

In [9]:
my_file = open("input.txt")
for line in my_file:
    print(line, end='')
my_file.close()

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

As you can see, the output is exactly as using cat:

In [10]:
!cat input.txt

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

## Reading your file in streaming mode

You may stumble upon the file.read() method which is used a lot. This might look convenient at first as you omit the for loop:

In [11]:
my_file = open('input.txt')
content = my_file.read()
print(content)
my_file.close()

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY


However, there is a major drawback using this approach. The memory consumed is considerably larger using this method:

In [12]:
import sys
my_file = open('input.txt')
content = my_file.read()
print(sys.getsizeof(content), 'bytes')

403 bytes


Compare this to the memory used in streaming mode:

In [13]:
my_file = open('input.txt')
for line in my_file:
    content = line
    print(sys.getsizeof(content), 'bytes') #now you can do something line by line

115 bytes
120 bytes
120 bytes
120 bytes
120 bytes
53 bytes


From the above code, you can see that the memory size is much smaller in streaming mode. Each loop, the line variable is overwritten. This saves memory, especially when using large files! Whenever possible, use streaming mode instead of the file.read() method.

## Write file

So far we have opened a file and displayed its content.
You may also want to write something to a file. 
To do so, first we have to open a new output file:

In [14]:
output_file = open("output.txt", "w")
print(output_file)
output_file.close()

<_io.TextIOWrapper name='output.txt' mode='w' encoding='UTF-8'>


We have generated a file object and assigned it to the variable output_file.
The "w" argument informs the Python interpreter that the file object should be opened in write mode.
The default is read mode ("r") which does not need to be specified.
To write something to the file we can use the file.write() method.

In [15]:
output_file = open("output.txt", "w")
output_file.write("This is written to the file" + '\n')
output_file.write("End of message" + '\n')
output_file.close() # Do not forget to close the file. This is very important!
print('Done')

Done


Now we have a file in the same directory as the Python script.
We can open it to check if the content has been written to the file:

In [16]:
!cat output.txt

This is written to the file
End of message


Instead of using write mode, which overwrites the file content each time that you open the file in 'w' mode we can also append to a file using the 'a' (append) mode:

In [17]:
output_file = open("output2.txt", "a")
output_file.write("This is new text" + '\n')
output_file.close()

Read content

In [18]:
!cat output2.txt

This is new text
This is another line of text
This is YET another line of text
This is new text
This is another line of text
This is YET another line of text
This is new text


Add a new line:

In [19]:
output_file = open("output2.txt", "a")
output_file.write("This is another line of text" + '\n')
output_file.close()

In [20]:
!cat output2.txt

This is new text
This is another line of text
This is YET another line of text
This is new text
This is another line of text
This is YET another line of text
This is new text
This is another line of text


## A safer method to open a file object

File objects need to be closed using the file.close() method. While this is not crucial for reading files, it is for writing. If you forget to close your file after write operations, this may cause data loss. Fortunately, Python offers a content managment protocol that automates the closing of the file. You can use the content management protocol using the with statement that acts as a wrapper to ensure that your file will always close. If you want to know more about the content management protocol (and maybe want to write your own context management class) see chapter 9 of your book. For know we limit ourself on the use of the with statement instead of explaining the internal details on how it works:

In [21]:
with open('output2.txt', 'a') as output_file:
    output_file.write("This is YET another line of text" + '\n')

In [22]:
!cat output2.txt

This is new text
This is another line of text
This is YET another line of text
This is new text
This is another line of text
This is YET another line of text
This is new text
This is another line of text
This is YET another line of text


Note that there is no file.close() statement. But can we still write to the file?

In [23]:
output_file.write("ok")

ValueError: I/O operation on closed file.

As you can see, the last write operation failed because the file object was already closed. All thanks to the with statement. It is always a good idea to use the with statement when using files. 

## Excercise: Read a sequence from a file and write the reverse, complement and reverse complement to a file.

Now we come to the final excersise:
The file: <a href="hier correcte link">dna.fasta</a> is a FASTA file. The FASTA format is a text-based format for representing nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. 
This is an example: <br>
\>My_dna_sequence <br>
atcaggatggggatggagagaggaccaaccac <br>
acagagtagagagagaggagagagacaagata <br>
tatatttttatacccaggagagacagatagag <br>

To open the file one could use the following code:

In [None]:
open_file = open("dna.fasta")

We can check if the file object is created:

In [None]:
print(open_file)

The file object is created but this way is not a very flexible way to open files. 
In this way, the file MUST be named "dna.fasta".
This is a nightmare for the user of the program. A much more flexible way is to use command line arguments.
In that way the program can be called including the name of the file as an argument.
This is exactly what cat does:

In [None]:
!cat dna.fasta

Here dna.fasta is a command line argument for the cat program.

In this way, the user can display the content with any name.
To use command line arguments we need a specific Python library: `sys`.
`sys` is a library that can handle command line argument using the `argv` attribute. `argv` returns a list of strings of the command line arguments.
Import this module using the following code:

In [None]:
import sys

The sys.argv attribute returns a list of command line arguments. All list items are strings. The first item of the list will be the name of the file. The second item will be your first command line argument. You can catch all arguments as follows:

In [24]:
args = sys.argv # all items in a list

Now we can catch a commandline argument with sys.argv as follows:

In [None]:
file_name = sys.argv[1] #second item from the list

The variable file_name refers now to the first command-line argument (that will be the file name).
Use IDLE3 to write a program that:
- will catch the filenames from the command line
- reads the content of the file line by line
- generates the reverse-complement of the dna string
- writes the reverse-complement DNA to a user-defined output file
- organise your code in functions. For simplicity, I did not use functions in this totorial but you should use functions for your script!

Note that you cannot simply run the code via IDLE3, since it expects a command-line argument from the commandline. You have to save the file first and call from the command-line your python file and an argument. For example:

>rev_comp_dna.py dna.fasta dna_rev_comp.fasta

## Solutions

Solutions for the excercises are given  below. Programming is like playing the piano: excercize, excercize, excercize. You learn most from typing each single word yourself. If you have no clue what to do you can have a look, but only after your first and second try!

<p><a href="Here the solution">rev_comp_dna.py</a></p>

