# Read and write to file

In the section, we'll cover:
1. How to read and write a file with standard library - `open()`, `read()` and `write()`.
2. How to read and write a file in csv/tsv format - built-in `csv` module.
3. How to use `pandas` to process csv/tsv file - `pd.read_csv()`, `pd.DataFrame.to_csv()`.

## Operate files with standard library
`open()`, `read()` and `write()`

In Python, a **file handler** refers to a connection or communication link between a program and a file on the disk.
The following example use the built-in `open()` function to open a file as a file handler and assign it to the variable `f`.

By default, the file handler is open in **read-only** mode. But it's a good practice to always specify the access mode in the second argument.

### Read

The read mode ("r", "r+") try to open an existing file and will cast an error if the file does not exist.
- Read Only ("r")
- Read and Write ("r+") : Any new content would partially overwrite the existing data.

In [94]:
f = open("fruits.txt", "r")

After the file handler object is created, now we can use `f.read()` function to read from the file.

At the beginning, the pointer is positioned at the beginning of the file.

In [95]:
print(f.read())

apple
banana
coconut



Once the `f.read()` run, the pointer will positioned at the end of the file.
If you run `f.read()` again nothing will be printed.

In [47]:
f.read()

''

if you want to move the pointer back, you can use `f.seek(0)` where 0 indicate the head of the file (just like index).

In [48]:
f.seek(0)

0

Let's read the content again. Note that without a `print()` function it will not split lines for you. The content is actually a string with some `\n`, which indicate the newline character.

In [49]:
f.read()

'apple\nbanana\ncoconut\n'

After the you done with the file. It's a good practice to use `f.close()` to close the file to prevent some potential issues.

In [50]:
f.close()

If you try to operate an already closed file. An error would occurred.

In [51]:
f.read()

ValueError: I/O operation on closed file.

There is an alternative way to manage the file handler - you can use `with open() as f`

It do exactly the same thing as `f = open()` but the file handler `f` in `with open` will be closed immediately after the `with` statement.
This format is recommended for opening a file if you don't need to access the file handler in many different part in your scripts.

In [52]:
with open("fruits.txt", "r") as f:
    print(f.read())

apple
banana
coconut



In [53]:
f.read()

ValueError: I/O operation on closed file.

For the usage of "r+", we will create a new file and demonstrate below.

### Write

The write mode ("w", "w+") totally erase the existing file and will create a new file if the file does not exist.
- Write Only ("w")
- Write and Read ("w+") : you can move the pointer and read the content you just write to the file.

In [64]:
with open("animals.txt", "w") as f:
    f.write("elephant")

In [65]:
with open("animals.txt", "r") as f:
    print(f.read())

elephant


Now I just simply open the file with write mode without doing anything.

In [66]:
with open("animals.txt", "w") as f:
    pass

All the contents are gone!

In [67]:
with open("animals.txt", "r") as f:
    print(f.read())




If you use the "w+" mode to open the file, you are allowed read the content after you move the pointer.

In [88]:
with open("animals.txt", "w+") as f:
    f.write("elephant\n")
    f.write("zebra\n")
    f.seek(0)
    print(f.read())

elephant
zebra



#### Problem - difference between "r+" and "w+"

We just demonstrate that both "w" and "w+" will totally erase the existing data.
Can you try to use "r+" to open the file "animals.txt" and write "lion" to it?

You may also move the pointer to the head and read the file again to see how it works.

In [89]:
# Please write and test your codes in this cell


In [90]:
with open("animals.txt", "r+") as f:
    f.write("lion")
    f.seek(0)
    print(f.read())

lionhant
zebra



### Append

The append mode ("a", "a+") create a new file if the file does not exist. If opening an existing file, it move the pointer to the end of the file.

- Append Only ("a")
- Append and Read ("a+"): just like "w+", let you read the content you just write.

In [91]:
with open("animals.txt", "a+") as f:
    f.write("donkey\n")
    f.seek(0)
    print(f.read())

lionhant
zebra
donkey



In my opinion, these 3 modes can cover 90% of your needs.
- Use "r" if you only need to read.
- Use "w" if you need to write to a new file.
- Use "a" if you want to append to an existing file.

## Work on tab-delimited or comma-delimited files (tsv, csv)

tsv and csv are very common and popular format for storing data. They can be transformed into a dataframe or table which can be modified easily.
Python has a 3rd-party library called `Pandas` which provides a lot of different functions for dataframe manipulation.
There is also another programming language "R" that basically born for dataframe processing.

Here we'll try to open tsv/csv file in Python default library, built-in `csv` module and Pandas.

### Use `f.read()` to open a tsv file

You can definately open a tsv file with Python default function `f.read()` we just learned.

In [156]:
with open("age.tsv", "r") as f:
    print(f.read())

Name	Gender	Age
Alice	F	20
Bob	M	40
Charlie	M	32
Dave	M	31
Eve	F	25
Frank	M	20
Grace	F	40



If we change a way to open it, you'll find out that the content is actually read as a very long string with some \t and \n.

The "\t" in the string are tabs and "\n" are newline characters.

In [159]:
f = open("age.tsv", "r")
f.read()

'Name\tGender\tAge\nAlice\tF\t20\nBob\tM\t40\nCharlie\tM\t32\n'

Let's try to use the string function `split()` to parse strings into list.

In [160]:
f.seek(0)
f.read().split("\n")

['Name\tGender\tAge', 'Alice\tF\t20', 'Bob\tM\t40', 'Charlie\tM\t32', '']

We successfully split the string into 4 items and store them in a list. However, you might notice there is an extra empty item at the end. Check the previous output, you'll find out that empty item is actually caused by an extra newline character at the end.

We can slightly modify our codes - first use a `strip()` function to remove the extra "\n" at the end and use the `split()` function to split them into items.

In [165]:
f.seek(0)
stripped_string = f.read().strip("\n")
stripped_string

'Name\tGender\tAge\nAlice\tF\t20\nBob\tM\t40\nCharlie\tM\t32'

In [166]:
stripped_string.split("\n")

['Name\tGender\tAge', 'Alice\tF\t20', 'Bob\tM\t40', 'Charlie\tM\t32']

The previous codes can be simplifed into a one-liner.

In [167]:
f.seek(0)
f.read().strip("\n").split("\n")

['Name\tGender\tAge', 'Alice\tF\t20', 'Bob\tM\t40', 'Charlie\tM\t32']

Next, we still need to deal with the tab character "\t".

We can use a `for` loop to iterate through these 4 items and split them again by "\t" - E.g. the second item will then become ["Alice", "F", 20].

And we can also create an empty list beforehand and save the list we got in each round by `append()`.

In [169]:
f.seek(0)

age_list = []
for data_string in f.read().strip("\n").split("\n"):
    data_list = (data_string.split("\t"))
    age_list.append(data_list)

In [170]:
age_list

[['Name', 'Gender', 'Age'],
 ['Alice', 'F', '20'],
 ['Bob', 'M', '40'],
 ['Charlie', 'M', '32']]

After that you can manipulate the list as you wish!

In [172]:
age_list[3]

['Charlie', 'M', '32']

#### Problem - get rid of the header

If you don't want the header ["Name", "Gender", "Age"] being stored in the age_list. Can you slightly modify the `for` loop to do that?

Hint: You can use a combo of `enumerate` and `continue`.

In [174]:
# Please write and test your codes after the f.seek(0) in order to read the data from the beginning of the file.
f.seek(0)
age_list = []


In [175]:
# Check the results
age_list

[]

#### Problem - Load data into dictionary

We learned how to save data into dictionary by `d[new_key] = new_value`. Can you modify the `for` loop to save the data to a dictionary instead of a list?

In [176]:
# Please write and test your codes after the f.seek(0) in order to read the data from the beginning of the file.
f.seek(0)
age_dict = {}



In [177]:
# Check the results
age_dict

{}

In [179]:
# Lastly, remember to close the file!
f.close()

### Read with `csv` module

Besides using the default library, Python also provide a convenient extension called `csv` module for csv/tsv processing.

However, the `csv` library is not preloaded so we have to use `import` function to load the library.

In [180]:
import csv

We can start to use the `csv` library.

Loading data with csv library is quite similar to the way we do with default library. You still need to use `open()` function to create a file handler first.

Next, we can use the `csv.reader()` to read the data. We also combine the `enumerate()` codes to skip the header.

In [183]:
with open("age.tsv", "r") as f:
    reader = csv.reader(f)
    for idx, line in enumerate(reader):
        if idx != 0:
            print(line)

['Alice\tF\t20']
['Bob\tM\t40']
['Charlie\tM\t32']


The format is weird!

That is because by default the `csv.reader` use the comma (",") as delimiter. If you want to use tab ("\t") you can specify it with `delimiter="\t"`

In [184]:
with open("age.tsv", "r") as f:
    reader = csv.reader(f, delimiter="\t")
    for idx, line in enumerate(reader):
        if not idx == 0:
            print(line)

['Alice', 'F', '20']
['Bob', 'M', '40']
['Charlie', 'M', '32']


`csv` module also provide a class `csv.writer()` to help writing to file.

In the example below, we load the file to `csv.writer()` and use its function `writerow()` to write content to the file. 

In [189]:
with open("age.tsv", "a+") as f:
    writer = csv.writer(f, delimiter="\t")
    writer.writerow(["Dave", "M", "31"])

    f.seek(0)
    print(f.read())

Name	Gender	Age
Alice	F	20
Bob	M	40
Charlie	M	32
Dave	M	31



#### Problem - write more data

Here we have a list that contains 3 additional data. Can you figure out a way to write them to the file "age.tsv"?

In [192]:
# Please write and test your codes in this cell

additional_data = [["Eve", "F", 25], ["Frank", "M", 20], ["Grace", "F", 40]]



In [194]:
# Use the following codes to check how it works.
with open("age.tsv", "r") as f:
    print(f.read())

Name	Gender	Age
Alice	F	20
Bob	M	40
Charlie	M	32
Dave	M	31
Eve	F	25
Frank	M	20
Grace	F	40



### Read from `pandas`

`Pandas` is a 3rd-party library of Python which provide fast and flexible experience in data processing. In this section, I'll show you how to read csv/tsv file in pandas.

First of all, we need to load the library by `import` since `pandas` is not preloaded in default Python environment.
Here I import pandas **as pd** so later I can use any of the pandas function with `pd.xx_functions` instead of `pandas.xx_functions`.

In [196]:
import pandas as pd

You can simply use `pd.read_csv()` to read the csv/tsv as a dataframe.
Note that the default separator is comma "," in `pd.read_csv()` so we need to specify `sep="\t"` to tell it we want to separate by tab.

Moreover, pandas automatically use the first row as header.

In [200]:
df = pd.read_csv("age.tsv", sep="\t")

In [201]:
df

Unnamed: 0,Name,Gender,Age
0,Alice,F,20
1,Bob,M,40
2,Charlie,M,32
3,Dave,M,31
4,Eve,F,25
5,Frank,M,20
6,Grace,F,40


Here I'm just demonstrating how powerful `Pandas` is. I can add a column to it.

In [213]:
df['State'] = ['Florida', 'California', 'Kentucky', 'Colorado', 'New York', 'California', 'Texas']

In [214]:
df

Unnamed: 0,Name,Gender,Age,State
0,Alice,F,20,Florida
1,Bob,M,40,California
2,Charlie,M,32,Kentucky
3,Dave,M,31,Colorado
4,Eve,F,25,New York
5,Frank,M,20,California
6,Grace,F,40,Texas


I can sort the dataframe by a column.

In [216]:
df.sort_values("Age")

Unnamed: 0,Name,Gender,Age,State
0,Alice,F,20,Florida
5,Frank,M,20,California
4,Eve,F,25,New York
3,Dave,M,31,Colorado
2,Charlie,M,32,Kentucky
1,Bob,M,40,California
6,Grace,F,40,Texas


If I want to know the average age of both gender, I can simply use the following codes to do that.

In [217]:
df.groupby("Gender").mean()

Unnamed: 0_level_0,Age
Gender,Unnamed: 1_level_1
F,28.333333
M,30.75


Since we've added a column "State" to the dataframe, let try to write it back to the file.

In pandas, you can simply use `df.to_csv()` to write to a file.

In [218]:
df

Unnamed: 0,Name,Gender,Age,State
0,Alice,F,20,Florida
1,Bob,M,40,California
2,Charlie,M,32,Kentucky
3,Dave,M,31,Colorado
4,Eve,F,25,New York
5,Frank,M,20,California
6,Grace,F,40,Texas


In [221]:
df.to_csv("age.tsv", sep="\t", index=False)

Let's check it again with a more traditional way.

In [222]:
with open("age.tsv", "r") as f:
    print(f.read())

Name	Gender	Age	State
Alice	F	20	Florida
Bob	M	40	California
Charlie	M	32	Kentucky
Dave	M	31	Colorado
Eve	F	25	New York
Frank	M	20	California
Grace	F	40	Texas

