# Working with Files in Python (.txt  .json  .csv)

The most datasets you will encounter, are available in either text files, json files or csv files. You will need to understand how to open any of these files, load its content to memory and after you're done working with it, how to store the results in (another) file. Saving data toa file is one of the simplest ways of saving data. When you write text to a file, the output will still be available after you close the terminal containing your program’s output. You can examine output after a program finishes running, and you can share the output files with others as well. You can also write programs that read the text back into memory and work with it again later.

## Reading from a  .txt file
To begin with, we need a file with a few lines of text in it. Let’s start with a file that contains pi to 30 decimal places with 10 decimal places per line:

In [None]:
filename = "data/pi_30_digits.txt"

with open(filename) as file:
    contents = file.read()

print(contents)

Let’s start by looking at the `open()` function. To do any work with a file, even just printing its contents, you first need to open the file to access it. The `open()` function needs one argument: the name of the file you want to open.  If you don't specify a path, Python looks for this file in the directory where the program that is currently being executed is stored. We have given it a path, where the '.' stands for the current folder (directory).

The `open()` function returns an object representing the file. Here, `open('pi_30_digits.txt')` returns an object representing pi_30_digits.txt. Python stores this object in file_object, which we will work with later in the program.

The keyword `with` denotes a Context Manager, which essentially wraps a block of code and performs an action at the end of the block, no matter how it exits. In this case, it closes the file once access to it is no longer needed. Notice how we call `open()` in this program but not `close()`. You could open and close the file by calling `open()` and `close()`, but if a bug in your program prevents the `close()` statement from being executed, the file may never close. This may seem trivial, but improperly closed files can cause data to be lost or corrupted. And if you call `close()` too early in your program, you’ll find yourself trying to work with a closed file (a file you can’t access), which leads to more errors. It is not always easy to know exactly when you should close a file, but with the structure shown here, Python will figure that out for you. All you have to do is open the file and work with it as desired, trusting that Python will close it automatically when the time is right.

Once we have a file object representing pi_30_digits.txt, we use the `read()` method in the second line of our program to read the entire contents of the file and store it as one long string in contents. When we print the value of contents, we get the entire text file back.

The only difference between this output and the original file is the extra blank line at the end of the output. The blank line appears because read() returns an empty string when it reaches the end of the file; this empty string shows up as a blank line. If you want to remove the extra blank line, you can use `rstrip()`.

## File paths

When you pass a simple filename like pi_30_digits.txt to the open() function, Python looks in the directory where the file that is currently being executed (that is, your .py program file) is stored.

Sometimes, depending on how you organize your work, the file you want to open is not located in the same directory as your program file. Therefore, you can use relative and absolute file paths as arguments to the open() function.

A **relative file path** tells Python to look for a given location relative to the directory where the currently running program file is stored. On Linux and OS X, you would write:

In [None]:
with open('data/filename.txt') as file_object:
    pass

This tells Python to look for the ".txt" file in the folder data, which is assumed to be located inside your current working directory. On **Windows** systems, you use a backslash (\) instead of a forward slash (/) in the file path:

In [None]:
with open('data\filename.txt') as file_object:
    pass

An **absolute file path**, tells the Python interpreter exactly where a file is located regardless of the current working directory. On Linux and OS X, absolute paths look like:

In [None]:
file_path = '/tmp/text_files/filename.txt' 
with open(file_path) as file_object:
    pass

and on **Windows** they look like this:

In [None]:
file_path = 'C:\Users\<username>\AppData\Local\Temp\text_files\filename.txt' 
with open(file_path) as file_object:
    pass

## Reading Line by Line

When you are reading a file, you will often want to examine each line of the file. You might be looking for certain information in the file, or you might want to modify the text in the file in some way. 

You can use a `for` loop on the file object to examine each line from a file one at a time.

In [None]:
with open('data/pi_30_digits.txt') as file_object:
    for line in file_object:
        print(line.rstrip())

## Making a List of Lines from a File

When you use `with`, the file object returned by `open()` is only available inside the `with` block that contains it. If you want to retain access to a file’s contents outside the with block, you can store the file’s lines in a list inside the block and then work with that list.

The following example stores the lines of `pi_30_digits.txt` in a list inside the with block and then prints the lines outside the with block.

In [None]:
with open('data/pi_30_digits.txt') as file_object:
    lines = file_object.readlines()

for line in lines:
    print(line.rstrip())

## Working with a File’s Contents

After you have read a file into memory, you can do whatever you want with that data, so let’s briefly explore the digits of pi. First, we’ll attempt to build a single string containing all the digits in the file with no whitespace in it.

**OBS!** When Python reads from a text file, it interprets all text in the file as a string. If you read in a number and want to work with that value in a numerical context, you will have to convert it to an integer using the `int()` function or convert it to a float using the `float()` function.

# Writing to a File



## Writing to an Empty File

To write text to a file, you need to call `open()` with a second argument telling Python that you want to write to the file. To see how this works, let’s write a simple message and store it in a file instead of printing it to the screen.

The call to `open()` in the following example has two arguments. The first argument is still the name of the file we want to open. The second argument, `'w'`, tells the Python interpreter, that we want to open the file in write mode. You can open a file in *read mode* (`'r'`), *write mode* (`'w'`), *append mode* (`'a'`), or a mode that allows you to *read and write* to the file (`'r+'`). If you omit the mode argument, Python opens the file in read-only mode by default.

The `open()` function automatically creates the file you are writing to if it does not already exist. However, be careful opening a file in write mode (`'w'`) because if the file does exist, Python will erase the file before returning the file object.

In [None]:
filename = 'data/msg.txt'

with open(filename, 'w') as file_object:
    file_object.write(contents)

In [None]:
%%bash
cat data/msg.txt

## Appending to a File

If you want to add content to a file instead of writing over existing content, you can open the file in append mode. When you open a file in append mode, Python does not erase the file before returning the file object. Any lines you write to the file will be added at the end of the file. If the file does not exist yet, Python will create an empty file for you.

In [None]:
filename = 'data/msg.txt'

with open(filename, 'a') as file_object:
    file_object.write(contents)

# Storing Data in JSON files


A simple way to persist and exchange machine readable data is using the `json` module. JSON stands for JavaScript Object Notation.

The `json` module allows you to dump simple Python data structures into a file and load the data from that file the next time the program runs. You can also use `json` to share data between different Python programs. Even better, the JSON data format is not specific to Python, so you can share data you store in the JSON format with people who work in many other programming languages. It is a useful and portable format, and it is easy to learn.

## Using json.dump() and json.load()

Let’s write a short program that stores a set of numbers and another program that reads these numbers back into memory. The first program will use json.dump() to store the set of numbers, and the second program will use json.load().
The json.dump() function takes two arguments: a piece of data to store and a file object it can use to store the data. Here’s how you can use json.dump() to store a list of numbers:

In [None]:
import json


numbers = list(range(10, 20, 2))
filename = 'data/numbers.json'

with open(filename, 'w') as f_obj:
    json.dump(numbers, f_obj)

Now we will write a program that uses `json.load()` to read the list back into memory.

In [None]:
import json


filename = 'data/numbers.json'

# open the file in read mode
with open(filename) as f_obj: 
    de_numbers = json.load(f_obj)

de_numbers

In [None]:
import json


def save_data(path, data):
    with open(path, 'w') as f_obj:
        json.dump(data, f_obj)
    

def read_data(path):
    with open(path) as f_obj: 
        content = json.load(f_obj)
    return content

    
# Some example data taken from: https://www.learningcontainer.com/json-example/
example_data = {
    "firstName": "Viola",
    "lastName": "Jacson",
    "gender": "woman",
    "age": 24,
    "address": {
        "streetAddress": "Udhna",
        "city": "San Jone",
        "state": "CA",
        "postalCode": "95221"
    },
    "phoneNumbers": [
        { "type": "home", "number": "27627" }
    ]
}

filename = 'data/viola_data.json'
save_data(filename, example_data)

In [None]:
read_data(filename)

## The CSV File Format

One simple way to store data in a text file is to write the data as a series of values separated by commas, called comma-separated values. The resulting files are called CSV files. 
For example, here are two lines of a famous dataset with information about the passengers of the titanic:

```csv
1161,0,3,"Pokrnic, Mr. Mate",male,17,0,0,315095,8.6625,,S
1162,0,1,"McCaffry, Mr. Thomas Francis",male,46,0,0,13050,75.2417,C6,C
```


This is a dataset from [Kaggle](https://www.kaggle.com/brendan45774/test-file) with giving the  Passenger id, whether they survived, which class they traveled, the name, sex, age, SibSp, Parch, ticket, Fare, Cabin and finally whether they embarked.


CSV files are simple. For example, CSV files
  * Do not have types for their values—everything is a string
  * Do not have settings for font size or color
  * Do not have multiple worksheets
  * Cannot specify cell widths and heights
  * Cannot have merged cells
  * Cannot have images or charts embedded in them
  
The advantage of CSV files is simplicity. CSV files are widely supported by many types of programs, can be viewed in text editors, and are a straightforward way to represent spreadsheet data. The CSV format is exactly as advertised: It is just a text file of comma-separated values.

**OBS:** Since CSV files are just text files, you might be tempted to read them in as a string and then process that string using the techniques you learned above. For example, since each cell in a CSV file is separated by a comma, maybe you could just call the `split()` method on each line of text to get the values, see below. But not every comma in a CSV file represents the boundary between two cells. CSV files also have their own set of escape characters to allow commas and other characters to be included as part of the values. The `split()` method does not handle these escape characters. Because of these potential pitfalls, you should always use the `csv` module for reading and writing CSV files.

### Parsing the CSV File Headers
Python’s `csv` module in the standard library parses the lines in a CSV file and allows us to quickly extract the values we are interested in. Let’s start by examining the first line of the file, which contains a series of headers for the data.

In [None]:
import csv

filename = 'data/titanic.csv'
with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
    print(header_row)

### Reading Data from Reader Objects in a `for` Loop

To read data from a CSV file with the csv module, you need to create a Reader object, see line 2 in the following. A Reader object lets you iterate over lines in the CSV file. For large CSV files, you will want to use the Reader object in a `for` loop.

In [None]:
with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    for row in reader:
        print('Row #' + str(reader.line_num) + ' ' + str(row))

### Extracting and Reading Data

Now that we know which columns of data we need, let’s read in some of that data.

We make an empty set called `ages` and then loop through the remaining rows in the file. The reader object continues from where it left off in the CSV file and automatically returns each line following its current position. Because we have already read the header row, the loop will begin at the second line where the actual data begins. On each pass through the loop, we append the data from index 2, the third column storing the age.

In [None]:
ages = set([])

with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    for row in reader:
        # OBS: cast to int otherwise we would read strings!
        ages.add(int(row[5]))
        
print(sorted(ages))
print(max(ages))

### Writing Data to CSV Files

A Writer object lets you write data to a CSV file. To create a Writer object, you use the csv.writer() function. Enter the following into the interactive shell:

First, call `open()` and pass it `'w'` to open a file in write mode. This will create the object you can then pass to `csv.writer()` to create a Writer object.

On Windows, you’ll also need to pass a blank string for the `open()` function’s newline keyword argument. For technical reasons beyond the scope of this book, if you forget to set the newline argument, the rows in `/tmp/output.csv` will be double-spaced.

The `writerow()` method for Writer objects takes a list argument. Each value in the list is placed in its own cell in the output CSV file. The return value of `writerow()` is the number of characters written to the file for that row (including newline characters). Notice how the Writer object automatically escapes the comma in the value `'614,5'` with double quotes in the CSV file. The `csv` module saves you from having to handle these special cases yourself.

In [None]:
import csv

    
with open('data/output.csv', 'w', newline=newline) as output_file:
    output_writer = csv.writer(output_file)
    
    output_writer.writerow(['2015', '1', '0', '5100', '614,5'])
    output_writer.writerow(['2015', '1', '0', '5104', '2,3'])
    output_writer.writerow(['2015', '1', '0', '5106', '1'])
    output_writer.writerow(['2015', '1', '0', '5110', '1'])

In [None]:
with open('/tmp/output.csv', 'w') as output_file:
    output_writer = csv.writer(output_file, delimiter=';', quotechar='|')
    output_writer.writerow(['2015', '1', '0', '5100', '614\t5'])
    output_writer.writerow(['2015', '1', '0', '5104', '2,3'])
    output_writer.writerow(['2015', '1', '0', '5106', '1'])
    output_writer.writerow(['2015', '1', '0', '5110', '1'])