# File Input/Output (I/O)

**Learning Objectives**:

- Read / write files using base Python and the `pandas` package.
- Use the `os` package to loop through multiple files.

****

We often will need to access data in external files. In order to be able to work with the data, we'll need to **read** the data into Python. Then, after an analysis, we might need to **write** new files which contain outputs. These two tasks - file input and output - are closely related, and there's a variety of approaches to complete them in Python.

## Importing a Single File 

### Opening Files with Base Python

The base Python way of importing files uses three basic steps:

1. Opening the file
2. Reading the file
3. Closing the file

Let's open the file below. What type is the `text` variable? What additional processing is needed to parse this file?

In [None]:
my_file = open("capitals.csv", "r")
text = my_file.read()
my_file.close()

print(text)

## Opening Files with `pandas`

More commonly, we will use a package that does a lot of the parsing for us. The `pandas` package has a rich set of tools to help import and parse files. For example, the `pandas.read_csv()` function is particularly useful here. The [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) shows the many arguments that can be used to customize the read_csv function. In this case, let's use the defaults for most arguments.

What additional processing would be necessary to parse this file? How many rows/columns are in the file?

In [None]:
import pandas as pd
pd.read_csv("capitals.csv")

# Filepaths

A **filepath** is the location of a file on your system. There are two kinds of filepaths:

* **absolute**: The filepath from the top level directory.
    * For Macs, these begin with a forward slash, e.g. `/Users/[USERNAME]/directory/subdirectory/file`.
    * For Windows, these begin with a backward slash or, more commonly, a volume, e.g. `C:\Documents\directory\subdirectory\file`.
* **relative**: The filepath relative to the current working directory (i.e. notebook location). Common locations include: 
    * File in same folder: `./file` or `file` (`.` means 'here').
    * Subfolder: `subfolder/file`.
    * Higher folder: `../sisterfolder/file` (`..` means 'go up one level in the directory').

Both kinds of paths work for file names, but they will look different in the code. 

The above example used a relative filepath, since the file is in the same directory. Where was the file in relationship to the current working directory? 

What would the direct filepath be for your machine? 

In [None]:
# YOUR CODE HERE


## Challenge 1: Subfolder Access

There is a `capitals2.csv` file in the `data` subfolder of this folder. This is a common structure for projects, because it is easy to locate with a relative path, and contains only the files that will be analyzed.

Each of the ways below are incorrect, and give an error. What is the error? How would you properly access the file?

In [None]:
pd.read_csv('capitals2.csv')

In [None]:
pd.read_csv('../capitals2.csv')

In [None]:
pd.read_csv('/data/capitals2.csv')

# Iterating through a directory

The data subfolder actually has two files in it. Let's say that we want to loop through each of the files in the directory: 

``` 
for file in directory:
    do something
```

One way is to make a list of relative paths, loop through the paths, and perform the operation. That would look like: 

In [None]:
files = ['data/capitals.csv', 'data/capitals2.csv']

for file in files:
    print("Processing filename:", file)
    print(pd.read_csv(file).head())

This method works for a few items, but what if we were processing dozens or hundreds of files? This would get quickly tedious. We can use the `os` package to automatically list the files in the directory.

In [None]:
import os

filelist = os.listdir('.')
filelist

Another useful function from `os` is `os.path.join()`, which joins together a directory and a file. 

In [None]:
base_path = '.'
file = '14_Statsmodels.ipynb'
joined = os.path.join(base_path, file)

In [None]:
print(base_path)
print(file)
print(joined)

## Challenge 2: Reading in Multiple Files

Use `os.listdir()` to identify each file in the `data` subfolder. Read each `csv` file in the list using `pd.read_csv()` and append it to the variable `df_list`. 

**Hint**: Are there any non-csv files in the folder? What happens when you try to read them in with `read_csv()`? How can we account for them?

**Bonus**: Wrap this code in a function called `read_files()` that takes as input a directory and returns a list of `pd.DataFrames`.

In [None]:
df_list = []

# YOUR CODE HERE


## Combining Multiple File Inputs

We can combine the `pd.DataFrames` using `pd.concat`, which takes a list of data frames and stacks them:

In [None]:
result = pd.concat(df_list)

## Writing Files

A `pd.DataFrame` can be exported to a csv (or other filetype) using `df.to_csv()`. This is a method function built-in to every data frame.

In [None]:
result.to_csv('result.csv')

Congratulations! We now have covered loading in single and multiple files, and common strategies for getting data into our Python environment for processing. Now, it's possible to use this data for analysis in the future.