# Reading and writing files, JSON

## Contents:

* File Input/Output
* Working with directories of files

## File Input/Output

A huge portion of our input data will come from files that we have stored on our computer (on the file system). A lot of analysis of these files is done in memory in Python, when working with them. We have to save them back to the file system to store the results. So, mastering the art of reading and writing is crucial in programming.

Until now, we have run stuff (almost instantly) in our Jupyter Notebooks, but imagine that we write code that takes a couple of ours to run on a large collection of files. Then we want to save the result, either for further analysis, or to make these files available (i.e. sharing) in your research. 

### Opening a file without the With statement

The following code opens a file in our filesystem, prints the first 10 lines and closes the file. Please note that this file must exist in your Colab session (when running on Colab) on your computer (when running locally).

If you are working locally and you have only have downloaded this notebook, go back to the repository and download the file to the appropriate path (or change the path below). 

**Please note:** The code below shows you how the `open()` function works. It's better to use a `with` block (see below), which does this opening and closing for you.

In [None]:
infile = open('data/adams-hhgttg.txt', 'r', encoding='utf-8')
#If you have put the file somewhere else, such as on your Drive, you should modify the path to open.
#For example to '/drive/data/adams-hhgttg.txt'. Otherwise, you will get a FileNotFound error when running this code.

for i, line in enumerate(infile):
    if i == 10:
        break
    print(line)

infile.close()

The key passage here is the one in which the `open()` function opens a file and return a **file object** (hint: try printing the type of `infile`), and it is commonly used with the following three parameters: the **name of the file** that we want to open, the **mode** and the **encoding**. 

- **filename**: the name of the file to open, this corresponds to the full/relative path to the file from the notebook. 

- the **mode** in which we want to open a file: the most commonly used values are `r` for **reading** (default, which means that you don't have to put this in explicitly), `w` for **writing** (overwriting existing files), and `a` for **appending**. (Note that [the documentation](https://docs.python.org/3/library/functions.html#open) report mode values that may be necessary in some exceptional case)

- **encoding**: which mapping of string to code points (conversion to bytes) to use, more on this later. 

>**IMPORTANT**: every opened file should be **closed** by using the function `close()` before the end of the program, or the file could be unavailable to successive manipulations or for other programs.

There are other ways to read a text file, among which the use of the methods `read()` and `readlines()`, that would simplify the above function in:

```python
infile = open('data/adams-hhgttg.txt', 'r', encoding='utf-8')
text = infile.readlines()
print(text[:10])
infile.close()
```

However, these methods **read the whole file at once**, thus creating capacity/efficiency problems when working with big corpora.

In the solution we adopt here the input file is read line by line, so that at any given moment **only one line of text** is loaded into memory. 

You can see all file object methods, including examples, on this W3schools page: https://www.w3schools.com/python/python_ref_file.asp

## Looping through folders and files: os.walk

If you want to load in multiple files in a folder, without explicitly providing the file pointers/paths for each file, you can also point to a folder. We can use the built-in `os` module to loop through a folder and load multiple files in memory.

In [None]:
import os  # You only have to do this once in your code. 
           # Always put this at the top of your file.

In [None]:
list(os.walk("data/gutenberg-extension"))

In [None]:
gutenberg_books = dict()  # Create an empty dictionary to store our data in

for root, dirs, files in os.walk("data/gutenberg-extension"):
    for file in files:
        
        if not file.endswith('.txt'):  # Why this?
            continue
        
        # You have to specify the full (relative) path, not only the file name.
        file_path = os.path.join(root, file)  
        
        with open(file_path, encoding='utf-8') as infile:
            gutenberg_books[file] = infile.read()

In [None]:
gutenberg_books.keys()

The `os.walk()` method is convenient if you are dealing with a combination of files and folders, no matter how deep the hierarchy goes (folders in folders etc.). A simpler function that we saw in the main notebook is `os.listdir()`.

### Quiz

Extend your code from the previous quiz to run on a whole directory of files using os.walk. Instead of calling your file statistics function for a single file, write code that loops through all text files in the *data* directory and runs your file statistics function on each file in the directory.


In [None]:
import nltk
from collections import Counter

folder_path = 'data/gutenberg-extension'

#Your code here

---