# File I/O
<span style='color:#5A5A5A'> February <mark style="background-color: #FFFF00">28</mark>, 2021 </span>

Last time we covered data structures (lists, tuples, dictionaries, sets) in Python that  allow us to work with more powerful data items than just the individual numbers, strings and Booleans that we had used before. We also discussed the important difference between call by value and call by reference.

Until now the course dealt with the basics of imperative programming in Python, and you have learned about the most important concepts that you need as a programmer. We will now leave the relatively secluded, controlled environment that we were in so far and look at how to read and write data from and to files, access online resources, use external libraries, and connected to that how to make programs more robust against errors that come from “the outside”.

Today we will cover how to read and write files in general, how to deal with CSV files in particular, and how to handle runtime errors that can for example be caused by user inputs or file operations.
Next time we will have a closer look at some of the most popular Python libraries for data science applications, such as pandas and matplotlib.

<h3 style='color:#3981CB'> Reading and Writing Files </h3>

Python distinguishes between only two types of files: text and binary. Basically, anything that is not a text file is regarded as a binary. Text files are sequences of lines, which are themselves sequences of characters that are terminated with a special end-of-line (EOL) character, often the newline character. The content of text files can be processed with the common string manipulation functionality, while processing binary files requires knowledge about their structure. For the moment we are only concerned with text files.

To open a file, first a file object needs to be created with the ```open()``` function:

```
<file_object> = open(<filename>, <mode>)
```

```<filename>``` is the name (path) of the file to open, and ```<mode>``` specifies for which kind of processing the file is opened ("r" for reading content, "w" for writing content, "a" for appending content, or "r+" for a special read and write mode).  For example:

In [None]:
# creating a file object in reading mode
file = open("shorttext.txt", "r")
print(file)
file.close()

When the file is opened, operations according to the chosen mode can be carried out, for example:

In [None]:
# file.read() to read all characters in the file
content = file.read()


# file.read(n) to read the first/next n characters of the file
first_n = file.read(10)


# file.readline() to read a (the first/next) line of the file
first_line = file.readline()


# file.readlines to read the content of the files line by line
lines = file.readlines()

# file.write to write (or append) text to a file
file.write("Hello World!\n")
file.write("It's cold today...\n")
file.writelines(["Another line\n", "and another line\n"]) 

In [None]:
# creating a file object in reading mode
file = open("shorttext.txt", "r")

# file.read() to read all characters in the file
content = file.read()
print(content)

# file.read(n) to read the first/next n characters of the file
first_n = file.read(10)
print(first_n)

# file.readline() to read a (the first/next) line of the file
first_line = file.readline()
print(first_line)

# file.readlines to read the content of the files line by line
lines = file.readlines()
print(lines)
file.close()

In [None]:
# file.write to write (or append) text to a file
file.write("Hello World!\n")
file.write("It's cold today...\n")
file.writelines(["Another line\n", "and another line\n"]) 

Play with the above code and a small text file if your choice to see what happens. Add printouts to visualize what has been read by the different commands.
When all operations on the file have been performed, the file should be closed again to avoid unintended side effects:

```
file.close()
```

With the ```with```-statement, Python provides an alternative, elegant way to handle files. It also takes care of closing the file, so it is a good idea to make it a habit to use it for file handling (and never forget closing):

In [None]:
with open("shorttext.txt", "r") as file:
    content = file.read()
    
with open("newtext.txt", "w") as file:
    file.write("Hello World!\n")
    file.write("It's cold today...\n")
    file.writelines(["Another line\n", "and another line\n"])

Note that here is also a short and elegant way to iterate over all lines of a file, without explicitly calling ```readlines()``` before:

```
for line in file:
    <do something with line>
```    
    
As a more complete example, see the following code to read the text from a file, encrypt it using the Caesar cipher, and write it into another file:

In [None]:
from caesarcipher import CaesarCipher

with open("shorttext.txt", "r") as file:
    content = file.read()
    
content_encrypted = CaesarCipher(content, offset=3)

with open("shorttext_encrypted.txt", "w") as file:
    file.write(content_encrypted.encoded)

This code produces no output on the command line, but if you try it with a text file yourself, you will see the effect in the new file that is created.

<h3 style='color:#3981CB'> Dealing With CSV Files </h3>

Let's look at another kind of text file, that you will frequently come across when working on data science problems: CSV files. CSV stands for "comma-separated values" and means that commas are used to separate the values in a line from each other. Sometimes also other characters are used as separators, such as the tabulator "\t" or the semicolon ";", so don't be confused if you see that. As such, CSV files are a simple means to represent tabular data. The following example is based on the Dutch municipalities data set from Kaggle (https://www.kaggle.com/justinboon/municipalities-of-the-netherlands/data), stored in the file dutch_municipalities.csv. We can open and read this file as in the examples above:

In [None]:
with open("dutch_municipalities.csv", "r") as csvfile:
    print(csvfile.read())

In this form (as one long string) the content of the CSV file is of course not of too much use, as it is difficult to access individual elements from it. Instead of reading the content file completely, we could read it line by line (getting a list of lines), and then split the lines at the separator to create a list or dictionary of the elements in each row of the table, resulting in big list of lists or list of dictionaries. Luckily, however, CSV files are so common that there is a package called csv that provides this and other frequently needed functionality for working with CSV files (please refer to the online documentation at https://docs.python.org/3/library/csv.html for full reference). Here are some examples of what working with the package can look like:

In [None]:
import csv

# csv.reader returns the content of the file as list of 
# lists of strings
with open("dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    for row in csvreader:
        print(row[0])

In [None]:
# csv.DictReader returns the content of the file as list of 
# dictionaries, using the first row of the CSV file as keys
with open("dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter='\t')
    for row in csvreader:
        print(f'{row["municipality"]}:\t {row["murders_2014"]}')

In [None]:
# same as the previous example, but printing only murder numbers
# if at least one murder happened
with open("dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter='\t')
    for row in csvreader:
        if int(row["murders_2014"]) > 0:
            print(f'{row["municipality"]}:\t {row["murders_2014"]}')

If you want to do more advanced things with the data from CSV files, like for example merge, join, or concatenate tables from different CSV files, you can absolutely do that with CSV files read in as above and the knowledge about loops, conditions, list, dictionaries etc. that you have, but it can be a bit tricky. This is why when such operations are (likely to be) needed, it is usually recommended to use the pandas library (http://pandas.pydata.org/), which has some specialized functions for this.

Pandas has an own function for reading CSV files, which returns the result as a so-called data frame, as shown in the following example:

In [None]:
import pandas as pd

df = pd.read_csv('dutch_municipalities.csv', sep="\t")
print(df)

Data frames are two-dimensional labeled data structures, very much like tables. The rows are labeled by an index (typically ascending from 0), and the columns are labeled by the column names, corresponding to the kind of data that is contained in them. See https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe for further details.

Data frames have a number of attributes, such as the column labels, the row indices and the types of the data in the columns (see a full list at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), that can be accessed as illustrated below:

In [None]:
print(df.index)
print(df.columns)
print(df.dtypes)

Via the ```iloc``` attribute we can access a row by its index, for example:

In [None]:
print(df.iloc[39])
print(type(df.iloc[39]))

Apparently, such single row of a data frame is of type “Series” (see https://pandas.pydata.org/pandas-docs/stable/reference/series.html for full reference), which basically means a one-dimensional labeled data structure. Series are iterable. You have maybe already noticed that many functions in, e.g., pandas and matplotlib take Series as input, and this is one way to get them.

Slicing works with ```iloc```, too, so a range of indices can be used to access several rows at a time. The result is of type “DataFrame” again:

In [None]:
print(df.iloc[39:42])
print(type(df.iloc[39:42]))

Similarly, a list of indices (not necessarily a range) can be used:

In [None]:
print(df.iloc[[38,40,42]]) 
print(type(df.iloc[[38,40,42]]))

The ```iloc``` access can also be used for indexing at both axes of the data frame, including accessing a single element (note the different resulting data types):

In [None]:
print(df.iloc[1:3,1:3])
print(type(df.iloc[1:3,1:3]))
print(df.iloc[3,3])
print(type(df.iloc[3,3]))

Very similar to ```iloc```, the loc attribute can be used to access (groups of) rows and columns by their labels. For example (note the difference in the interpretation of the range now that the labels of the indexes are used):

In [None]:
print(df.loc[1:3,"murders_2014"])

Without using any attributes, just in pairs of square brackets, columns in a dataframe can be addressed by their name. For example, to access the “murders_2014” column of our example data frame, it’s name can be used as reference:

In [None]:
print(df["murders_2014"])
print(type(df["murders_2014"]))

Again, the output is a Series, so this is another way to get this data structure.

Accessing several columns at once is also possible, the result is a data frame:

In [None]:
print(df[["municipality","murders_2014"]])
print(type(df[["municipality","murders_2014"]]))

Another handy feature is to filter data frames based on certain criteria. For example, we might only want to see the data of municipalities with at least 3 murders:

In [None]:
print(df[df["murders_2014"]>=3])

Or the data for the province of Utrecht:

In [None]:
print(df[df["province"]=="Utrecht"])

Or for the municipalities in the province of Utrecht with at least one murder:

In [None]:
print(df[(df["murders_2014"]>=1) & (df["province"]=="Utrecht")])

Note that are several other clever ways to access (ranges of) values in data frames, but discussing them all would be out of scope of this lecture. We will see some of them in the examples later on, but if you are interested in digging deeper into this, please refer to the official “Indexing and Selecting Data” guide at http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html or ask Google if you are looking for hints how to index best in a specific situation.

In the following we will look at a few methods that pandas data frames provide. This selection is by no means complete, either, but you can find the full list at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html.

For example, there are methods to easily sum up values, or get basic statistic information like the max, min, mean and median values. Just to show a few:

In [None]:
print(f"There were {df['murders_2014'].sum()} murders in total.")
print(f"The maximum number of murders in a municipality was "\
      f"{df['murders_2014'].max()}.")
print(f"The average number of murders per municipality was "\
      f"{df['murders_2014'].mean():.3f}.")
print(f"The average number of murders per municipality with at "\
      f"least one murder was "\
      f"{df[df['murders_2014']>=1]['murders_2014'].mean():.3f}.")

The ```hist``` method can be used to plot simple histograms from data:

In [None]:
print(df["murders_2014"].hist())

Or, with the number of bins equal to the maximum number of murders:

In [None]:
print(df["murders_2014"].hist(bins=df["murders_2014"].max()))

If a data frame contains several columns with numeric values, the ```hist``` method will create histograms for all of them. For example, when called on the whole data frame:

In [None]:
print(df.hist())

The possibilities for making histogram with ```hist()``` more “beautiful” are a bit limited, so other libraries should be used when a better design is wanted. However, for a quick check of the distribution of data in a data frame it is very suitable.

As a last example for today, we want to sort the data in the data frame according to the number of murders (descending), instead of having them sorted by municipality, like it is now. The ```sort_values()``` method is what we need:

In [None]:
sorted_df = df.sort_values("murders_2014", ascending=False)
print(sorted_df)

Note that the index column was sorted with the rest of the data, too. So, if we want to have indices there running up from 0, we need to reset the index:

In [None]:
sorted_reindexed_df = sorted_df.reset_index(drop=True)
print(sorted_reindexed_df)

Finally, note that data frames can easily be saved as CSV files with the ```to_csv()``` method. For example:

In [None]:
sorted_reindexed_df.to_csv('dutch_municipalities_sorted.csv')

We will see more about data frames in the following lecture(s).