<div style="text-align: right">
    <i>
        LING 5981/6080: Fundamentals of Python <br>
        Fall 2020 <br>
        Aniello De Santo
    </i>
</div>

# Notebook 5: File IO

This notebook shows how to read and write data from and to the external files such as `.txt` or `.csv` locally (i.e. if you are running notebooks directly on your laptop) and through Colab. It refers to some concepts (e.g. dictionaries) which we haven't discussed yet. But keep it as a reference guide for future projects in which we will need to import data from external files.

If you are working with this notebook in Colab, run the next cell.

In [1]:
from google.colab import files

ModuleNotFoundError: No module named 'google'

## \[On Colab\] Opening files

Working with files in Colab is very different from the way it is done when working with the files locally.

The `files.upload()` method envokes an interactive window and asks you to upload the file that you want to process. Choose the `grades.csv` file for now. It then saves the contents of the file in the variable `uploaded`.

In [None]:
uploaded = files.upload()

The format of this file is `.csv`, and it is a simple way to store tables. CSV stands for "comma separated values", and indeed, these files look like this:

    Name,Last Name,Department,Points
    Matt,Bellamy,AMS,79
    Dominic,Howard,LIN,82
    Chris,Wolstenholme,CSE,72

If we are working through Colab, the format in which file is read in the memory of the computer is a dictionary, where the key is a filename and the value is the contents of the file.

In [None]:
print(type(uploaded))
print(uploaded)

Notice the `b` in the beginning of every value. This `b` stands for _bytes_, and the in fact, the type of the value is not a string:

In [None]:
type(uploaded['grades.csv'])

Sequences of bytes are machine-readable, whereas sequences of symbols are human-readable. _Strings are **encoded** as sequences of bytes, and in order to get a string from that sequence of bytes, we need to **decode** it,_ read more [here](https://www.geeksforgeeks.org/byte-objects-vs-string-python/) if you are interested.

In [None]:
uploaded['grades.csv'] = uploaded['grades.csv'].decode('utf-8')
print(type(uploaded['grades.csv']))
print(uploaded['grades.csv'])

**UTF-8** (Unicode Transformation Format) is a format that encodes all Unicode characters with up to $8$ bytes. **Unicode** is a universal format of encoding symbols for nearly all human languages, see [the chart](https://unicode.org/charts/).

The code above converts that sequence of bytes to a human-readable format. Now, the file is represented as a single string and can be parsed based on the new-line character.

In [None]:
file_lines = uploaded['grades.csv'].strip().split("\n")
print(file_lines)

**Practice.** Extract values from the column `Department`.

    Expected output: ['AMS', 'LIN', 'CSE']

**Practice.**  Create a dictionary that will have students' names as keys, and their grades as values.

    Expected output: {'Matt': '79', 'Dominic': '82', 'Chris': '72'}

Now, let's invoke the file upload window once again and upload both files: `grades.csv` and `novartis_microsoft.txt`.

In [None]:
uploaded = files.upload()

The dictionary `uploaded` now has two keys: `grades.csv` and `novartis_microsoft.csv`.

In [None]:
print(uploaded.keys())

The following code then will split the text of the file `novartis_microsoft.txt` into different lines, i.e. it will create a list of strings, where every string is a line of the original file. It will then print all lines of the file:

In [None]:
nov_mic = uploaded['novartis_microsoft.txt'].decode('utf-8').strip().split("\n")
for line in nov_mic:
  print(line)

## \[On Colab\] Working with csv files with `pandas`

Additional functionality comes from the package `pandas`. It is a package that has a wide variety of uses, and one of them is the easy way to extract a column from a csv file. Here, we see a new way to import a package:

    import pandas as pd

It means that you are importing `pandas`, but instead of the full name, you are going to refer to the package as `pd` further in the code.

In [None]:
import pandas as pd
import io

We also import a package `io` to gain access to the `io.BytesIO` function that will help us to load the sequence of bytes with which the file is represented (**bite stream**) into a _dataframe_. **Dataframe** is just a name `pandas` uses to represent a table.

In [None]:
df = pd.read_csv(io.BytesIO(uploaded['grades.csv']))

The function `pd.read_csv` can take a byte stream created by `io.BytesIO` from the file `grades.csv` as its argument.

`pandas` automatically extracts column names from the `csv` document, and the columns can be accessed simply in the following format:
  
    dataframe[column_name]

In [None]:
print("pandas representation:")
print(df["Department"])

print("\nList representation:", list(df["Department"]))

A row can be accessed by index, in this case, the syntax is the following:

    dataframe.iloc[index]

In [None]:
print("pandas representation:")
print(df.iloc[1])

print("\nList representation:", list(df.iloc[1]))

Read [here](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) more on the data processing with pandas.

## \[On Colab\] Creating files

The generic template for working with files outside of Colab is the following:
 
    with open(path_to_file, mode) as name_of_open_file:
        # code where the open file is referred to as name_of_open_file
        
In Colab, it can be used to create files: `path_to_file` will simply contain the name and the extension of the file that we are creating, and `mode` is the mode of working with the file. If the mode is `w` (i.e. writing), the new file is created. The parameter `name_of_open_file` is a way to refer to the file-writing object we initialized. The other way to call this object is **Text IO Wrapper** (TextIOWrapper).

There are two ways to write the information to the file via the TextIOWrapper (let's call it `f`):
  * `f.write(string)` takes a string as an argument and writes it to the file;
  * `f.writelines(list_of_strings)` takes a list of strings as an argument and writes every single one of those lines to the file.

In [None]:
with open('example.txt', 'w') as f:
  f.write('Hello world!')

To download the file we created, use the following code:

    files.download(name_of_the_file)
    
It is a function from the `google.colab.files` module that we imported in the fist cell of this notebook.

In [None]:
files.download('example.txt')

**Warning:** you might get the "Failed to fetch" error message if you run the cell above. If so, please follow the advice from [this page](https://stackoverflow.com/questions/53581023/google-colab-file-download-failed-to-fetch-error).

While `write` writes a single line of text in the file, `writelines` takes a list of strings as its argument, and writes every single string to the file.

In [None]:
with open('another_example.txt', 'w') as f:
  f.writelines(['Hello world!', 'How are you, world?', 'Goodbye world!'])
files.download('another_example.txt')

**Practice.** As you can see, the `writelines` method doesn't make every single string from that line start from a new line. How can we create a file that will contain the following $3$ lines?

    Hello world!
    How are you, world?
    Goodbye world!

The rest of this notebook exemplifies working with the files outside of the Colab environment, and therefore if you are working in Colab, the cells won't run.

## \[Locally\] Opening files

There are two files in the folder `files`: `novartis_microsoft.txt` and `grades.csv`. To open or create a file, we will use the following syntax:

    with open(path_to_file, mode) as name_of_open_file:
        # code where the open file is referred to as name_of_open_file
        
`path_to_file` is a string that points to the file that we want to open or create. The current notebook is in the `notebooks` repository, and therefore in order to give the adress of, for example,  `novartis_microsoft.txt`, we need to provide the following path: `'files/novartis_microsoft.txt'`.

`mode` is a string that defined the mode in which you are going to work with the file. The main modes are the following ones:
  * `'r'` (read): in this case we expect the file with the indicated name to already exist, and we are going to read the file line-by-line, where lines are separated by a new line character from each other;
  * `'w'` (write): opening a file with a writing mode will _create_ that file on the computer and will allow us to write strings into that file;
  * `'a'` (append): opens an already existent files and allows to add new lines to the end of that file.
  
There are many other modes in which it is possible to open a file, but you can read about them on your own [here](https://stackabuse.com/file-handling-in-python/).

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    for line in file:
        print(line)

The variable `file` is a name for the .txt file when it is loaded in the memory and ready to be processed. Its type is `<class '_io.TextIOWrapper'>` and it is an iterable that contains ordered strings.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    print("Type of `file`:", type(file), "\n")
    for line in file:
        print(type(line))
        print(line)

Every line in a text file ends with a new line character `\n` -- this is how we know when a new line starts! However, if you want to avoid printing a new line every time you are displaying the line, we can use the string method `strip`.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    for line in file:
        print(line.strip(), end = " ")

If instead of iterating through the lines of the file you want to get access to all of them at once, we can read all the lines of it into some variable by using `readlines` method: it creates a list of strings, where every string is a separate line of the file.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    lines = file.readlines()
    print(type(lines))
    print(lines)

Another way to avoid overt iteration and to get lines one by one, is to read them in memory one after another by using the `readline` method.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    line = file.readline()
    print(line)
    line = file.readline()
    print(line)

Notice, that every time you execute `readline`, it moves the the next line of the file. We need to use `seek` method that goes to the bite indicated of the file, and therefore using `seek(0)` will move us back to the very beginning of the file.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    line = file.readline()
    print(line)
    line = file.readline()
    print(line)
    file.seek(0)
    line = file.readline()
    print(line)

If you are using the `with open(filepath, mode)` syntax, the file is being open in the memory only while the indented code is being executed. As soon as we finished executing the code within the `with` codeblock, the variable `file` becomes unavailable.

In [None]:
with open('files/novartis_microsoft.txt', 'r') as file:
    line = file.readline().strip()
    print(line)
    
line = file.readline()

Another way to open the file and keep it in memory _until explicitly closed_, is to create the `open(file)` object in memory. Then, after the file was processed, it needs to be closed using the `close` method.

    file = open(filepath, mode)
    # code 
    file.close()
    
**Warning:** if the file is open in the `w` mode, i.e. if the file is being created, failure to close the file will result in losing all the information that we intended to write in that file. In other modes, it can result in file damage as well.

In [None]:
file = open('files/novartis_microsoft.txt', 'r')
line = file.readlines()
print(line[0])
file.close()

Even though the file is closed, the variable `lines` is still active: `readlines` loaded all the lines from the file into `lines` before we closed the file.

In [None]:
print(line[2].strip())

## \[Locally\] Writing files

As I mentioned before, the mode `w` opens the files in the writing mode, i.e. creates the files.

* `readline` reads a line and returns a _string_ containing that line;
* `readlines` reads all lines and returns a _list of strings_.

In the writing mode, there are methods that write line or lines in a similar manner:

* `writeline` takes a _string_ as its argument and writes it to the newly created file;
* `writelines` takes a _list of strings_ as its argument and writes all of them to the newly created file.

In [None]:
file = open('files/newfile.txt', 'w')
text_to_write = ["Hello world!", "It is Wednesday.", "Middle of the week!"]
file.writelines(text_to_write)
file.close()

**Warning:** it is possible to write only lists of strings. If the data that needs to be written contains other data types, make sure to convert them to strings!

In [None]:
file = open('files/newfile.txt', 'w')
text_to_write = ["Hello world!", 42]
file.writelines(text_to_write)
file.close()

The usual `str` function takes care of converting nearly any datatype to its string representation.

In [None]:
file = open('files/newfile.txt', 'w')
text_to_write = ["Hello world! ", str(42)]
file.writelines(text_to_write)
file.close()

## \[Locally\] Working with CSV files

It is in fact possible to engineer a way to work with csv files using the same methods we already discussed.

In [None]:
with open('files/grades.csv', 'r') as file:
    for line in file:
        print(line.strip())

Every line of the file is still a string, and therefore to represent them as a list of values, we will need to split them.

In [None]:
with open('files/grades.csv', 'r') as file:
    for line in file:
        print(line.strip().split(","))

A simpler way to read csv files in Python is to use `csv` or `pandas` packages.

### \[Locally\] Working with csv through `csv` package

In [None]:
import csv

In order to read a csv file using the `csv` package, right after opening the file, we need to define a `csv.reader` for it. It will parse the rows automatically!

In [None]:
with open('files/grades.csv', 'r') as file:
    csvreader = csv.reader(file)
    for row in csvreader:
        print(row)

Similarly, to write files, we want to define a `scv.writer` and change the editing mode to `w`. Then we will be able to write rows of the csv one-by-one by applying `writerow` method to the `csv.writer` object.

In [None]:
with open('files/greetings.csv', 'w') as file:
    csvwriter = csv.writer(file)
    csvwriter.writerow(["hello", "hi", "howdy"])
    csvwriter.writerow(["zdravstvujte", "privet", "hej"])

You can read more about the functionality of the `csv` package [here](https://docs.python.org/3/library/csv.html).

However, frequently we want to extract the values from a particular _column_ and this might be slightly trickier then extracting a row.

### \[Locally\] Working with csv through `pandas` package

In [None]:
import pandas as pd

We can then use `pd.read_csv(filepath)` in order to import the csv file. And then the columns can be simply referred to by their names!

In [None]:
grades = pd.read_csv('files/grades.csv')
grades["Name"]