# Python File Input/Output

-----

In this notebook, we build on the foundation provided by previous notebooks to introduce how to read and write data from and to a file. This is an important skill since we often want to share our results or will need to rerun an analysis, both of which are made considerably easier when the data can be reused.

First, you will learn how to open a file and write data into the file. Next, you will learn how to read and write data into delimiter-separated files (e.g., comma-separated value, or CSV, files). Secondly, you will learn more about working with Python modules (or packages). As a specific example, you will learn to use the `csv` module to read and write CSV data into Python programs. Finally, you will learn to persist complex data like a DataFrame by using the pickle module.

## Table of Contents
[Working with Files](#Working-with-Files)  

[Pandas I/O](#Pandas-I/O)  

[Data Persistence Techniques](#Data-Persistence-Techniques)


-----
[[Back to TOC]](#Table-of-Contents)

## Working with Files

When working with files, or any other system object, we must be careful about properly managing the underlying resource. In this particular case, that means a file and the associated file descriptor that the host operating system uses to reference the actual file. Whenever we open a file, we want to be sure that the file is properly closed and that any data that a program wrote to the file has been written to permanent storage. Thus, we need to ensure that every file that was opened has been properly closed. To open a file, Python has an `open` method that opens the named file and returns a file object that you either read from or write to depending on the mode used to open the file. Conversely, Python also has a `close` method that closes the file object. 

To explicitly state why a file is being opened, the `open` method accepts a _mode_ argument, whose default values is `rt` or _open for reading text data_. The allowed modes are detailed in the following table.

| Mode | Description                       |
| ---- | --------------------------------- |
| 'r'  | reading (default)                 |
| 'w'  | writing, truncate file first      |
| 'x'  | create and open file for writing  |
| 'a'  | writing, append to file if exists |
| 'b'  | binary mode                       |
| 't'  | text mode (default)               |
| '+'  | open for reading and writing      |

Historically, you would only read from a text file or write to a text file by using traditional Python file input and output (with the advent of powerful data science modules in Python, however, we now often read and write data directly from advanced data structures like a Pandas `DataFrame`). Thus, to open a text file named `test.txt` for writing without truncating the existing file contents (i.e., append), you would use `f = open('test.txt', 'a')`, and after all operations on the file are complete, you would use `f.close()` to close the file and release all associated resources. 

In Python, file input/output employs a runtime [context](https://docs.python.org/3/reference/datamodel.html?highlight=context%20manager#with-statement-context-managers), which is a way to enforce what should happen when a code block is entered and exited. The _context_ is created by using the `with` command in Python, where the rest of the line following the `with` command creates the actual context, which manages the entry into and exit from the enclosed code block. For our purposes, the standard application for a Python context is opening and closing files. As demonstrated in the following code block, we can now open a file, perform operations on the file, and no longer worry about closing the file, which is now taken care of automatically by the context.

```
with open('temp.txt', 'a') as fout:
    fout.write(data)
```

-----

### Writing Text Data

As previously described, to write text (or data that can easily be converted to text) data to a file, we need to open the file by using a context (which is created via the `with` statement in Python). We need to define a variable to refer to the newly opened file so that we can write to the correct file. We write text to the file by using the `write` method on the file that we just opened.

The following code snippet demonstrates this technique. First, we define a variable that holds the name of the file that will hold our text data. Next, we open the file as `fout`, after which we write two lines before exiting the context, which closes the file and ensures the text data are correctly written to the file. Since there's no path info in the file name, the file is created under current folder.

------

In [1]:
# File writing demonstration

# file name under current folder
out_file = 'temp.txt'

# We explicitly place a newline at the end of each string
with open(out_file, 'w') as fout:
    fout.write("Hello World!\n")
    fout.write("Goodbye World!\n")

-----

To read data with Python, we simply open the file (in a context). By default, for a text file we iterate though the file object, which returns each line of the text file as a Python string.

```
with open(out_file, 'r') as fin:
    for line in fin:
        print(line)
```

The following code opens the file we just created in the above cell, read the file line by line and print them out.

In [2]:
with open('temp.txt', 'r') as fin:
    for line in fin:
        print(line)

Hello World!

Goodbye World!



<font color='red' size = '5'> Student Exercise </font>

In the empty **code** cell below, write a simple Python script to create a text file and name it `temp2.txt`. Write a few short sentences into the file. Then  open the file `temp2.txt` and read and display each line from the file.

-----

### Optional: Data Encoding

The `open` method also takes an `encoding` attribute that can be used to specify the character encoding used in the file.  Originally, the only character encoding used by computers was the ASCII encoding, which only required 8-bits to represent each character. This encoding only represented the standard American typewriter characters and thus failed to work for non-English languages or words. To support character encodings for any language, the [Unicode Consortium](http://www.unicode.org) was formed and standardized character encodings were subsequently developed. One of the most popular current character encodings is `utf-8`, which is a Unicode standard. 


-----
[[Back to TOC]](#Table-of-Contents)

## Pandas I/O

A previous lesson introduced the Pandas DataFrame, which is a powerful data structure that mimics a traditional spreadsheet. Data can easily be read from a CSV file, a fixed width format text file, a JSON file, an HTML file (i.e., a webpage), and a relational database into a Pandas DataFrame. 

In this notebook, this capability is demonstrated by first reading a CSV file using pandas function `read_csv`. After acquiring this file, we demonstrate how to write a DataFrame into another CSV file, followed by reading the newly created file into a DataFrame.


### Read from CSV file

When reading the airport data from this CSV file, we specify comma as the delimiter. This is not necessary since comma is the default delimiter. We set delimiter just to demonstrate how to use the argument. If the data is seperated by other charactors like "|", you will have to set the argument explicitly. We also explicitly indicate that the index column is 'iata', which is the airport code. If you don't specify `index_col='iata'`, `iata` will be loaded as a regular column in the DataFrame and the DataFrame will have a default index that is row number starting from 0. `read_csv` takes a lot of arguments. To understand the function thoroughly, please check out the function doc string with `help(pd.read_csv)` or `pd.read_csv?`.

In [3]:
import pandas as pd

# Read data from CSV file, and display subset
dfa = pd.read_csv('airport-data.csv', delimiter=',', index_col='iata')
#Show the first 5 rows
dfa.head()

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
00M,Thigpen,Bay Springs,MS,USA,31.953765,-89.234505
00R,Livingston Municipal,Livingston,TX,USA,30.685861,-95.017928
00V,Meadow Lake,Colorado Springs,CO,USA,38.945749,-104.569893
01G,Perry-Warsaw,Perry,NY,USA,42.741347,-78.052081
01J,Hilliard Airpark,Hilliard,FL,USA,30.688012,-81.905944


In [4]:
# Show the last four rows
dfa.tail(4)

Unnamed: 0_level_0,airport,city,state,country,lat,long
iata,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZER,Schuylkill Cty/Joe Zerbey,Pottsville,PA,USA,40.706449,-76.373147
ZPH,Zephyrhills Municipal,Zephyrhills,FL,USA,28.228065,-82.155916
ZUN,Black Rock,Zuni,NM,USA,35.083227,-108.791777
ZZV,Zanesville Municipal,Zanesville,OH,USA,39.944458,-81.892105


### File Input/Output

We now demonstrate how to write and read these data by using Pandas' concise file I/O functionality. First, we define the names of the local CSV file. Second, we write the DataFrame in the CSV format. Finally, we read the CSV file into a new DataFrame and display the first few rows for comparison with the original DataFrame. Notice that this time when we call `pd.read_csv` we don't set `delimiter` since the data is separated by comma. We also don't set `index_col`, the result DataFrame then has default index and `iata` is loaded as a regular column.

In [5]:
# Define file names with type indicated by suffix
csv_file = 'airport-data2.csv'

In [6]:
# Write CSV file
dfa.to_csv(csv_file)

In [7]:
# Now read in the CSV file and display first five rows for comparison
dfa2 = pd.read_csv(csv_file)

dfa2.head(5)

Unnamed: 0,iata,airport,city,state,country,lat,long
0,00M,Thigpen,Bay Springs,MS,USA,31.953765,-89.234505
1,00R,Livingston Municipal,Livingston,TX,USA,30.685861,-95.017928
2,00V,Meadow Lake,Colorado Springs,CO,USA,38.945749,-104.569893
3,01G,Perry-Warsaw,Perry,NY,USA,42.741347,-78.052081
4,01J,Hilliard Airpark,Hilliard,FL,USA,30.688012,-81.905944


-----
[[Back to TOC]](#Table-of-Contents)


## Data Persistence Techniques

We have already discussed the simplest persistence technique, basic file input/output, in this notebook. By using the Python programming language, you can open a file for reading and writing and even use binary mode to save storage space (or even directly use a compression technique by using the appropriate Python library like bzip2). Below we will introduce one of the simplest approaches for data persistence that is available when using the Python programming language, **pickling**.

To demonstrate how to pickle data, however, we first need a complex data structure. In the following code cell, we will create a DataFrame, with which we will use to demonstrate how to pickle data to and from a file.

-----

In [8]:
import pandas as pd

dt = {'A' : [0, 1, 2, 3, None], 
      'B' : [5, 6, None, 8, 9], 
      'C' : [10, None, 12, 13, 14], 
      'D' : [15, 16, 17, 18, 19],
      'E' : [20, 21, None, None, 24]}

df = pd.DataFrame(dt)

df

Unnamed: 0,A,B,C,D,E
0,0.0,5.0,10.0,15,20.0
1,1.0,6.0,,16,21.0
2,2.0,,12.0,17,
3,3.0,8.0,13.0,18,
4,,9.0,14.0,19,24.0


### Pickling

Python provides a simple technique, called pickling, that we can use to easily save data to a file and to later reconstitute the data into a Python program.

Pickling writes the _class_ information for any data being written to the file along with the data. When you _unpickle_ data, this class information is used to properly reconstitute the data in the pickled file. Pickling is easy to use and can often suffice for simple data persistence tasks. To pickle data to a file, you  must import the pickle module and open a file in binary writing mode `'wb'`. After this, simply call the `pickle.dump()` method with the data to write and the file stream. The file created in this process is a binary file, which means the content is not readable.

-----

In [9]:
import pickle

p_file = 'test.p'

with open(p_file, 'wb') as fout:
    pickle.dump(df, fout)

-----

Unpickling data is also easy; simply open the appropriate file in binary read mode `'rb'` and call the `pickle.load()` method to retrieve the data from the file and assign to a variable.

-----

In [10]:
with open(p_file, 'rb') as fin:
    new_df = pickle.load(fin)

new_df

Unnamed: 0,A,B,C,D,E
0,0.0,5.0,10.0,15,20.0
1,1.0,6.0,,16,21.0
2,2.0,,12.0,17,
3,3.0,8.0,13.0,18,
4,,9.0,14.0,19,24.0


Pandas DataFrame has functions `to_pickle()` which pickles the DataFrame to a pickle file. Pandas has function `read_pickle()` to read a DataFrame from a pickle file. This is demonstrated in the below cell.

In [11]:
df.to_pickle('test2.p')
new_df = pd.read_pickle('test2.p')
new_df

Unnamed: 0,A,B,C,D,E
0,0.0,5.0,10.0,15,20.0
1,1.0,6.0,,16,21.0
2,2.0,,12.0,17,
3,3.0,8.0,13.0,18,
4,,9.0,14.0,19,24.0


-----

While easier than custom read/write routines, pickling still requires the file system to provide support for concurrency, consistency, and durability. To go any further with data persistence, we will need to work with database systems, which is the topic of a future module. 

---

<font color='red' size = '5'> Student Exercise </font>

In this notebook, you have seen how to read data from and write data to files by using Python and how to pickle and unpickle data. In the following code cell, try to complete the following tasks.

1. Try writing and reading integer and floating point data to/from a file.
4. Try to pickle integer and floating point data to a file.
5. Try to pickle a Python list, a Python dictionary, and a Pandas DataFrame to a file, and also unpickle them into new variables.

-----

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. Python3 Tutorial section on [Input/Output](https://docs.python.org/3/tutorial/inputoutput.html)
2. The discussion on [Files](http://getpython3.com/diveintopython3/files.html), as mentioned in this notebook, from the book _Dive into Python_ provides a detailed view of persisting data
3. The book _Think Python_ includes a discussion on [reading and writing data](http://greenteapress.com/thinkpython2/html/thinkpython2015.html)
4. The book, A Byte of Python, has a section on [Input/Output](https://python.swaroopch.com/io.html)

-----

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode