# Module 8: File I/O and Error Handling
March 4, 2022

Last time we covered data structures (lists, tuples, dictionaries, sets) in Python that allow us to work with more powerful data items than just the individual numbers, strings and Booleans that we had used before. We also discussed the important difference between call by value and call by reference.

Until now the course dealt with the basics of imperative programming in Python, and you have learned about the most important concepts that you need as a programmer. We will now leave the relatively secluded, controlled environment that we were in so far and look at how to read and write data from and to files, access online resources, use external libraries, and connected to that how to make programs more robust against errors that come from “the outside”. 

Today we will cover how to read and write files in general, how to deal with CSV files in particular, and how to handle runtime errors that can for example be caused by user inputs or file operations.

Next time we will have a look at fetching data and other resources from the internet, and how to interact with web services from within Python programs. 

## Reading and Writing Files

Python distinguishes between only two types of files: **text and binary**. Basically, anything that is not a text file is regarded as a binary. Text files are sequences of lines, which are themselves sequences of characters that are terminated with a special end-of-line (EOL) character, often the newline character. The content of text files can be processed with the common string manipulation functionality, while processing binary files requires knowledge about their structure. For the moment we are only concerned with text files.

To open a file, first a file object needs to be created with the ```open()``` function:

```
<file_object> = open(<filename>, <mode>)
```

```<filename>``` is the name (path) of the file to open, and ```<mode>``` specifies for which kind of processing the file is opened ("r" for reading content, "w" for writing content, "a" for appending content, or "r+" for a special read and write mode).  For example:

In [None]:
# create a file object in reading mode
file = open("data/shorttext.txt", "r")
print(file)
file.close()

When the file is opened, operations according to the chosen mode can be carried out. When all operations on the file have been performed, the file should be closed again to avoid unintended side effects:
```
file.close()
```
Play around with the following code examples and a small text file if your choice to see what happens. Add printouts to visualize what has been read by the different commands.

For example, when opened in reading mode we can call different functions for reading content from the file:

In [None]:
# creating a file object in reading mode
file = open("data/shorttext.txt", "r")

# file.read() to read all characters in the file
# content = file.read()
# print(content)

# file.read(n) to read the first/next n characters of the file
# first_n = file.read(10)
# print(first_n)

# file.readline() to read a (the first/next) line of the file
# first_line = file.readline()
# print(first_line)

# file.readlines to read the content of the files line by line
lines = file.readlines()
print(lines)

# close file
file.close()

When opened in writing mode, we can call diffent functions to write text into the file:

In [None]:
# creating a file object in writing mode
file = open("data/textdump.txt", "w")

# file.write to write (or append) text to a file
file.write("Hello World!\n")
file.write("It's cold today...\n")
file.writelines(["Another line\n", "and another line\n"]) 

# close file
file.close()

Change this example from writing to appending mode (parameter "a") and see what the difference is.

With the ```with```-statement, Python provides an alternative, elegant way to handle files. It also takes care of closing the file, so it is a good idea to make it a habit to use it for file handling (and never forget closing):

In [None]:
with open("data/shorttext.txt", "r") as file:
    content = file.read()
    
with open("data/newtext.txt", "w") as file:
    file.write("Hello World!\n")
    file.write("It's cold today...\n")
    file.writelines(["Another line\n", "and another line\n"])

Note that here is also a short and elegant way to iterate over all lines of a file, without explicitly calling ```readlines()``` before:

```
for line in file:
    <do something with line>
```    
    
As a more complete example, see the following code to read the text from a file, encrypt it using the Caesar cipher, and write it into another file:

In [None]:
from caesarcipher import CaesarCipher

with open("data/shorttext.txt", "r") as file:
    content = file.read()
    
content_encrypted = CaesarCipher(content, offset=3)

with open("data/shorttext_encrypted.txt", "w") as file:
    file.write(content_encrypted.encoded)

This code produces no output on the command line, but if you try it with a text file yourself, you will see the effect in the new file that is created.

![](img/activity_small.png) Challenge!

Create a function that receives a filename and counts the number of times each word appears in the text file.

Hints:
1. Use a dictionary to store the word frequencies
2. Don't care for punctuation 
3. Use a loop to go over the words
4. Use the ```split()``` method to get the list of words from a string. E.g. 

In [None]:
text = "this is a sample text"
text.split()

In [None]:
def word_frequencies(file):
    frequencies = {}
    with open(file, "r") as file:
        content = file.read()
        for word in content.split():
            if word in frequencies:
                frequencies[word] += 1
            else:
                frequencies[word] = 1
        return frequencies
    
word_frequencies("data/shorttext.txt")

## Dealing With CSV Files
Let's look at another kind of text file, that you will frequently come across when working on data science problems: CSV files. CSV stands for "comma-separated values" and means that commas are used to separate the values in a line from each other. Sometimes also other characters are used as separators, such as the tabulator "\t" or the semicolon ";", so don't be confused if you see that. As such, CSV files are a simple means to represent tabular data. The following example is based on the Dutch municipalities data set from Kaggle (https://www.kaggle.com/justinboon/municipalities-of-the-netherlands/data), stored in the file dutch_municipalities.csv. We can open and read this file as in the examples above:

In [None]:
with open("data/dutch_municipalities.csv", "r") as csvfile:
    print(csvfile.read())

In this form (as one long string) the content of the CSV file is of course not of too much use, as it is difficult to access individual elements from it. Instead of reading the content file completely, we could read it line by line (getting a list of lines), and then split the lines at the separator to create a list or dictionary of the elements in each row of the table, resulting in big list of lists or list of dictionaries. Luckily, however, CSV files are so common that there is a package called csv that provides this and other frequently needed functionality for working with CSV files (please refer to the online documentation at https://docs.python.org/3/library/csv.html for full reference). Here are some examples of what working with the package can look like:

In [None]:
# import the csv library
import csv

# csv.reader returns the content of the file as list of lists of strings
with open("data/dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')
    for row in csvreader:
        print(row[0])

In [None]:
# csv.DictReader returns the content of the file as list of dictionaries, using the first row of the CSV file as keys
with open("data/dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter='\t')
    for row in csvreader:
        print(f'{row["municipality"]}: {row["university"]}')

In [None]:
# same as the previous example, but printing only municipalitiers with at least one university
with open("data/dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter='\t')
    for row in csvreader:
        if int(row["university"]) != 0:
            print(f'{row["municipality"]}: {row["university"]}')

![](img/activity_small.png)  Challenge!

Write a code to print only the municipalities with an average household income above 40000

In [None]:
with open("data/dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter='\t')
    for row in csvreader:
        if int(row["avg_household_income_2012"]) > 40000:
            print(f'{row["municipality"]}: {row["province"]}')

In [None]:
with open("data/dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter='\t')
    for row in csvreader:
        income = row["avg_household_income_2012"]
        if income != "" and int(row["avg_household_income_2012"]) > 40000:
            print(f'{row["municipality"]}: {row["province"]}')

### Pandas

If you want to do more advanced things with the data from CSV files, like for example merge, join, or concatenate tables from different CSV files, you can absolutely do that with CSV files read in as above and the knowledge about loops, conditions, list, dictionaries etc. that you have, but it can be a bit tricky. This is why when such operations are (likely to be) needed, it is usually recommended to use the pandas library (http://pandas.pydata.org/), which has some specialized functions for this.

Pandas has an own function for reading CSV files, which returns the result as a so-called data frame, as shown in the following example:

In [None]:
import pandas as pd

df = pd.read_csv('data/dutch_municipalities.csv', sep="\t")
print(df)

Data frames are two-dimensional labeled data structures, very much like tables. The rows are labeled by an index (typically ascending from 0), and the columns are labeled by the column names, corresponding to the kind of data that is contained in them. See https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe for further details.



<figure>
  <img src="https://www.w3resource.com/w3r_images/pandas-data-frame.svg" alt="Pandas data frame" style="width:45%">
  <figcaption>Souce: https://www.w3resource.com/</figcaption>
</figure>

The ```head()``` method return the first ```n``` rows (default = 5) of a data frame. It is useful for quickly testing if your object has the right type of data in it.

In [None]:
df.head()

Data frames have a number of attributes, such as the column labels, the row indices and the types of the data in the columns (see a full list at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), that can be accessed as illustrated below:

In [None]:
print(df.index)
print("----------")
print(df.columns)
print("----------")
print(df.dtypes)

The ```info()``` method prints a concise summary of a DataFrame:

In [None]:
print(df.info())

Via the ```iloc``` attribute we can access a row by its index, for example:

In [None]:
print(df.iloc[39])
print("----------")
print(type(df.iloc[39]))

Apparently, such single row of a data frame is of type “Series” (see https://pandas.pydata.org/pandas-docs/stable/reference/series.html for full reference), which basically means a one-dimensional labeled data structure. Series are iterable. You have maybe already noticed that many functions in, e.g., pandas and matplotlib take Series as input, and this is one way to get them.

Slicing works with ```iloc```, too, so a range of indices can be used to access several rows at a time. The result is of type “DataFrame” again:

In [None]:
print(df.iloc[39:42])
print("----------")
print(type(df.iloc[39:42]))

Similarly, a list of indices (not necessarily a range) can be used:

In [None]:
print(df.iloc[[38,40,42]]) 
print("----------")
print(type(df.iloc[[38,40,42]]))

The ```iloc``` access can also be used for indexing at both axes of the data frame, including accessing a single element (note the different resulting data types):

In [None]:
print(df.iloc[1:3,1:3])
print("----------")
print(type(df.iloc[1:3,1:3]))
print("----------")
print(df.iloc[3,3])
print("----------")
print(type(df.iloc[3,3]))

Very similar to ```iloc```, the ```loc``` attribute can be used to access (groups of) rows and columns by their labels. For example (note the difference in the interpretation of the range now that the labels of the indexes are used):

In [None]:
print(df.loc[1:3,"population"])
print("----------")
print(type(df.loc[1:3,"population"]))

Without using any attributes, just in pairs of square brackets, columns in a dataframe can be addressed by their name. For example, to access the “murders_2014” column of our example data frame, it’s name can be used as reference:

In [None]:
print(df["population"])
print("----------")
print(type(df["population"]))

Again, the output is a Series, so this is another way to get this data structure.

Accessing several columns at once is also possible, the result is a data frame:

![](img/activity_small.png)  Challenge! (small)

What is the diference between ```df[39]``` and ```df.iloc[39]```?

In [None]:
df.iloc[39]
df[39]

In [None]:
print(df[["municipality","population"]])
print("----------")
print(type(df[["municipality","population"]]))

Another handy feature is to filter data frames based on certain criteria. For example, we might only want to see the data of municipalities with at least 150,000 inhabitants:

In [None]:
print(df[df["population"]>=150000])

Or the data for the province of Utrecht:

In [None]:
print(df[df["province"]=="Utrecht"])

Or for the municipalities in the province of Utrecht with at least 150,000 inhabitants:

In [None]:
print(df[ (df["province"]=="Utrecht") & (df["population"]>=150000) ])

Note that are several other clever ways to access (ranges of) values in data frames, but discussing them all would be out of scope of this lecture. We will see some of them in the examples later on, but if you are interested in digging deeper into this, please refer to the official “Indexing and Selecting Data” guide at http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html or ask Google if you are looking for hints how to index best in a specific situation.

In the following we will look at a few methods that pandas data frames provide. This selection is by no means complete, either, but you can find the full list at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html.

For example, there are methods to easily sum up values, or get basic statistic information like the max, min, mean and median values. Just to show a few:

In [None]:
print(f"Population was {df['population'].sum()} in total.")
print(f"The maximum population in a municipality was "\
      f"{df['population'].max()}.")
print(f"The average population per municipality was "\
      f"{df['population'].mean():.3f}.")
print(f"The average population per municipality with at "\
      f"least 1 university was "\
      f"{df[df['university']>=1]['population'].mean():.3f}.")

The ```hist``` method can be used to plot simple histograms from data:

In [None]:
print(df["avg_household_income_2012"].hist())

Or, with a larger number of bins:

In [None]:
print(df["avg_household_income_2012"].hist(bins=20))

If a data frame contains several columns with numeric values, the ```hist``` method will create histograms for all of them. For example, when called on the whole data frame:

In [None]:
print(df.hist())

The possibilities for making histograms with ```hist()``` more “beautiful” are a bit limited, so other libraries should be used when a better design is wanted. However, for a quick check of the distribution of data in a data frame it is very suitable.

As a last example for today, we want to sort the data in the data frame according to average household income (descending), instead of having them sorted by municipality, like it is now. The ```sort_values()``` method is what we need:

In [None]:
sorted_df = df.sort_values("avg_household_income_2012", ascending=False)
print(sorted_df[["municipality", "avg_household_income_2012"]])

Note that the index column was sorted with the rest of the data, too. So, if we want to have indices there running up from 0, we need to reset the index:

In [None]:
sorted_reindexed_df = sorted_df.reset_index(drop=True)
print(sorted_reindexed_df[["municipality", "avg_household_income_2012"]])

Finally, note that data frames can easily be saved as CSV files with the ```to_csv()``` method. For example:

In [None]:
sorted_reindexed_df.to_csv('data/dutch_municipalities_sorted.csv')

We will see more about data frames in the following lecture(s).

## Error Handling
There are basically two kinds of errors that can be detected by the Python interpreter: syntax (aka parsing) errors and exceptions (aka runtime or execution-time errors). ```SyntaxErrors``` are caused by syntactically incorrect code (like invalid variable names, forgotten indentations, braces, quotation marks or colons, etc.; Spyder will often already point you to them). They are fixed by correcting the code accordingly. Syntactically correct code can however still cause exceptions during exection. For example, a division by zero will result in a ```ZeroDivisonError```, and a type mismatch between str and int will result in a ```TypeError```. We say that an exception is "thrown" at runtime when the respective error occurs, and we can add code to "catch" and handle it if that happens (and thus prevent the program from simply crashing). That is done by the try-and-except construct in Python. Simply put, it defines what should be tried, and what happens if that goes wrong:

```
try:
    <do something>
except <error>:
    <do something to react on error>
```

For example, a `ValueError` is thrown when the user's input is not convertible into an integer, so we can catch it and display an error message accordingly:

In [None]:
try:
    x = int(input("Please enter a number: "))
except ValueError:
    print("That was no valid number.")

In [None]:
int('hello')


In this case, it would in practice be handy if the user is asked to try again, until (s)he enters a valid input. Maybe even encapsulated into a function, to have a specific, error-handling reader available for reuse:

In [None]:
def read_integer(prompt):
    while True:
        try:
            x = int(input(prompt))
            return x
        except ValueError:
            print("That was no valid number. Try again.")
            
# in main program:
number = read_integer("Please enter a number:" )

As another example: When handling files, it can easily happen that the path to the file to be opened is not correct, and the file cannot be opened. Then the ```FileNotFoundError``` can be caught to prevent the program from crashing because of that:

In [None]:
filename = input("Enter file name: ")
while True:
    try:
        with open(filename, "r") as file:
            print(file.read())
        break
    except FileNotFoundError:
        print("File not found. Please try again.")
        filename = input("Enter file name: ")

There are several built-in exceptions in Python. We cannot go through them all, but you find them listed at https://docs.python.org/3/library/exceptions.html.

Often several things can potentially go wrong, so that it makes sense to catch several exceptions:

In [None]:
number1 = read_integer("Enter number 1: ")
number2 = read_integer("Enter number 2: ")
try:
    print(number1 * number2)
    print(number1 / number2)
except (FloatingPointError, OverflowError, ZeroDivisionError):
    print("Something went wrong with the calculation.")

Or in a more specific variant, distinguishing between division by zero and all other kinds of errors:

In [None]:
number1 = read_integer("Enter number 1: ")
number2 = read_integer("Enter number 2: ")
try:
    print(number1 * number2)
    print(number1 / number2)
except ZeroDivisionError:
    print("Division by 0!")
except:
    print("Something went wrong with the calculation.")

As you can maybe guess from the previous example, and except clause with no specific error defined will catch all (remaining) errors that happen in the try clause. In such a case, it is often useful to assign a name to the exception that is caught, so that the error-handling code can check its type or get the underlying error message, to deal with the exception accordingly. For example:

In [None]:
number1 = read_integer("Enter number 1: ")
number2 = read_integer("Enter number 2: ")
try:
    print(number1 * number2)
    print(number1 / number2)
except Exception as err:
    print("Error handling for:", err)

Finally, note that with the ```raise``` statement it is also possible to let your own code throw one of the predefined or also self-defined exceptions:

In [None]:
temperature = read_integer("Enter temperature: ")
try:
    if 0 < temperature < 100:
        print("Water is liquid.")
    else:
        raise Exception("incompatible temperature", temperature)
except Exception as err:
    print(err) 

In practice it needs a bit of experience to decide how and where to implement error-handling behavior in a software. In the scope of the projects that you are working on in this course, it would not be feasible to surround each individual statement by try-and-except clauses. As a practical rule, error-handling should be implemented at places where things can easily go wrong, such as reading input from the user (even users with a lot of goodwill make typos), handling files (working with file systems is always prone to unexpected behavior) or accessing online resources and services (communication with them can be affected by network problems etc.). Generally, the less control the programmer (or their code) has over what happens, the more error-handling is a good idea.

![](img/activity_small.png)  Challenge!

Write a code to print only the municipalities with an average household income above 40000

Handle the case when the average household income is missing

In [None]:
#with open("data/dutch_municipalities.csv", "r") as csvfile:
#    csvreader = csv.DictReader(csvfile, delimiter='\t')
#    for row in csvreader:
#       if int(row["avg_household_income_2012"]) > 40000:
#            print(f'{row["municipality"]}: {row["province"]}')
       
            
with open("data/dutch_municipalities.csv", "r") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter='\t')
    for row in csvreader:
        try: 
            if int(row["avg_household_income_2012"]) > 40000:
                print(f'{row["municipality"]}: {row["province"]}')
        except ValueError:
            print(f'No INCOME for --> {row["municipality"]}: {row["province"]}')

## Exercises

Please use Quarterfall to submit and check your answers. 

### 1. Interview Anonymization (★★★★☆)
Imagine you are a journalist, and you have written a text about an interview with somebody. Because the person wants to remain unrecognized, you have to replace their name through a fictive one everywhere in the text before it gets published. Write a Python program that reads the file containing the interview text, replaces all occurrences of the original name by a new one (the `str.replace()` function can be used here), and saves the changed text in the file. You can use the text file "interview-with-a-syrian-refugee.txt" or create an own one. Do not forget to implement error-handling.

### 2. Longest Word (★★★★☆)
Reuse your code from exercise 5.5 (Text Analysis) to create a function that finds the longest word in a text. Apply it to the text file that you used for exercise 1 above. The output should be something like: 
```
The longest word in the text is "responsibility".
```
Again, keep in mind to implement error-handling.

### 3. Randomized Story-Telling (★★★★☆)
One of the simple pen-and-paper games I remember from my childhood days goes as follows: A paper sheet is divided into four columns for the questions “Who?”, “Does what?”, “How?” and “Where?”. The first player would write down a person in the first column, then fold it away, the second would fill in a verb, fold it away, etc. After the fourth column has been filled, the complete sentence is read out. It could then be something like “My brother is showering excessively at the gas station.”

Write a program that creates a user-defined number of such random sentences. The file `“inputs.csv”` contains a list of possible answers to all of the four questions. Take the values from there. Feel free to add further words to the CSV file to create more variation. The output of the program should be something like:
```
How many sentences do you want to create? 3
My granny is dancing massively at the fair.
The butcher is travelling aggressively in bed.
My grandpa is reading nicely in the bathroom.
```

### 4. Population and Universities per Province (★★★★☆)
Write a Python program that reads in the CSV file `"dutch_municipalities.csv"` that we already used in the lecture. Sum up the population and universities for each province and write the result into a new CSV file `“dutch_provinces.csv”`, in alphabetical order of the province names. Its content should look like:
```
province,population,universities
Drenthe,488892.0,0
Flevoland,400179.0,0
Friesland,580537.0,0
Gelderland,1993851.0,2
Groningen,495508.0,1
Limburg,1119751.0,1
Noord-Brabant,2390214.0,2
Noord-Holland,2766854.0,2
Overijssel,1139754.0,1
Utrecht,1254034.0,1
Zeeland,380619.0,0
Zuid-Holland,3579503.0,3
```

### 5. Error Handling (★★★☆☆)
Add adequate "try and except" error handling to your code for exercises 1.-4.
Include it in all code that you write from now on, at least when dealing with user inputs, file reading/writing operations, and accessing resources or services on the web.

## Extras for the Weekend
Exercise 3 was hopefully a bit of fun, but of course we generated a very simple kind of prose text there. The website https://eh.bard.edu/generating-algorithmic-poetry/ describes how to use Python to automatically generate poems in the style of Shakespeare or Dickinson. Have a look if you find that interesting!