# Working with Files

**Learning Objectives**:
- Read / write files using base Python and the pandas package
- Use the `os` package to loop through multiple files


# Importing a single file 

## Using `open`

The base Python way of importing files uses three basic steps:

1. Opening the file
2. Reading the file
3. Closing the file


Let's open the file below. What type is the `text` variable? What additional processing is needed to parse this file?

In [9]:
my_file = open("capitals.csv", "r")
text = my_file.read()
my_file.close()

print(text)

Country,Capital,Latitude,Longitude
Afghanistan,Kabul,34¡28'N,69¡11'E
Albania,Tirane,41¡18'N,19¡49'E
Algeria,Algiers,36¡42'N,03¡08'E
American Samoa,Pago Pago,14¡16'S,170¡43'W
Andorra,Andorra la Vella,42¡31'N,01¡32'E
Angola,Luanda,08¡50'S,13¡15'E
Antigua and Barbuda,W. Indies,17¡20'N,61¡48'W
Argentina,Buenos Aires,36¡30'S,60¡00'W
Armenia,Yerevan,40¡10'N,44¡31'E
Aruba,Oranjestad,12¡32'N,70¡02'W
Australia,Canberra,35¡15'S,149¡08'E
Austria,Vienna,48¡12'N,16¡22'E
Azerbaijan,Baku,40¡29'N,49¡56'E
Bahamas,Nassau,25¡05'N,77¡20'W
Bahrain,Manama,26¡10'N,50¡30'E
Bangladesh,Dhaka,23¡43'N,90¡26'E
Barbados,Bridgetown,13¡05'N,59¡30'W
Belarus,Minsk,53¡52'N,27¡30'E
Belgium,Brussels,50¡51'N,04¡21'E
Belize,Belmopan,17¡18'N,88¡30'W
Benin,Porto-Novo (constitutional cotonou) (seat of gvnt),06¡23'N,02¡42'E
Bhutan,Thimphu,27¡31'N,89¡45'E
Bolivia,La Paz (adm.)/sucre (legislative),16¡20'S,68¡10'W
Bosnia and Herzegovina,Sarajevo,43¡52'N,18¡26'E
Botswana,Gaborone,24¡45'S,25¡57'E
Brazil,Brasilia,15¡47'S,47¡55'W
Brit

## Using `pandas`

More commonly, we will use a package that does a lot of the parsing for us. The `pandas` package has serious tools to help import and parse files, notably the `pandas.read_csv()` function. The [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) shows the many arguments that can be used to customize the read_csv function. In this case let's use the defaults for most arguments.


What additional processing would be necessary to parse this file? How many rows/columns are in the file?

In [10]:
import pandas as pd
pd.read_csv("capitals.csv")

Unnamed: 0,Country,Capital,Latitude,Longitude
0,Afghanistan,Kabul,34¡28'N,69¡11'E
1,Albania,Tirane,41¡18'N,19¡49'E
2,Algeria,Algiers,36¡42'N,03¡08'E
3,American Samoa,Pago Pago,14¡16'S,170¡43'W
4,Andorra,Andorra la Vella,42¡31'N,01¡32'E
...,...,...,...,...
195,Venezuela,Caracas,10¡30'N,66¡55'W
196,Viet Nam,Hanoi,21¡05'N,105¡55'E
197,Yugoslavia,Belgrade,44¡50'N,20¡37'E
198,Zambia,Lusaka,15¡28'S,28¡16'E


# Filepaths

There are two kinds of filepaths:
* **absolute**: The filepath from the top level directory
    * For Macs, this usually looks like '/Users/[usrname]/directory/subdirectory/subsubdirectory/file'
    * For Windows, this usually looks like: 
* **relative**: The filepath relative to the current working directory (i.e. notebook location). Common locations are: 
    * File in same folder: './file' or 'file' (`.` means 'here')
    * Subfolder: 'subfolder/file'
    * Higher folder: '../sisterfolder/file' (`..` means 'go up one level in the directory')

Both kinds of paths work for file names, but they will look different in the code. 


The above example uses a relative filepath, since the file is in the same directory. Where was the file in relationship to the current working directory? 

What would the direct filepath be for your machine? 



In [20]:
## your code here

## Challenge 1: Subfolder access

There is a `capitals2.csv` file in the data subfolder of this folder. This is a common structure for projects, because it is easy to locate with a relative path, and contains only the files that will be analyzed.

Each of the ways below are incorrect, and give an error. What is the error? How would you properly access the file?

In [22]:
pd.read_csv('capitals2.csv')

In [None]:
pd.read_csv('../capitals2.csv')

In [None]:
pd.read_csv('/data/capitals2.csv')

# Iterating through a directory

The data subfolder actually has two files in it. Let's say that we want to loop through each of the files in the directory. 

``` 
for file in directory:
        do something
```

One way is to make a list of relative paths, loop through the paths, and perform the operation. That would look like: 

In [26]:
files = ['data/capitals.csv','data/capitals2.csv']
for f in files:
    print("processing filename:",f)
    print(pd.read_csv(f).head())

processing filename: data/capitals.csv
          Country           Capital Latitude Longitude
0     Afghanistan             Kabul  34¡28'N   69¡11'E
1         Albania            Tirane  41¡18'N   19¡49'E
2         Algeria           Algiers  36¡42'N   03¡08'E
3  American Samoa         Pago Pago  14¡16'S  170¡43'W
4         Andorra  Andorra la Vella  42¡31'N   01¡32'E
processing filename: data/capitals2.csv
            Capital         Country Latitude Longitude
0             Kabul     Afghanistan  34¡28'N   69¡11'E
1            Tirane         Albania  41¡18'N   19¡49'E
2           Algiers         Algeria  36¡42'N   03¡08'E
3         Pago Pago  American Samoa  14¡16'S  170¡43'W
4  Andorra la Vella         Andorra  42¡31'N   01¡32'E


This method works for a few items, but what if we were processing dozens or hundreds of files, this would get quickly tedious. We can use the `os` package to automatically list the files in the directory.

In [None]:
import os

filelist = os.listdir('.')
filelist

Another useful function from os is `os.path.join()`, which joins together a directory and a file. 

In [39]:
print('dir:','.')
print('file:',filelist[0])
print('joined:',os.path.join('.',filelist[0]))

dir: .
file: 10_numpy.ipynb
joined: ./10_numpy.ipynb


## Challenge 2: Reading in multiple files

Use `os.listdir` to identify each file in the `data` subfolder. Read each csv file in the list using `pd.read_csv()` and append it to the variable `dflist`. 


**Hint**: Are there any non-csv files in the folder? What happens when you try to read them in with `read_csv()`? How can we account for them?

**Bonus**: Wrap this code in a function called `read_files()` that takes as input a directory and returns a list of DataFrames.

In [47]:
#student code

dflist=[]


##solution
def read_files(directory):
    dflist=[]
    for f in os.listdir('data'):
        print(f)
        if f.split('.')[1]=='csv':
            dflist.append(pd.read_csv(os.path.join('data',f)))
    return(dflist)
dfs = read_files('data')

capitals2.csv
capitals.csv
.ipynb_checkpoints


## Combining files

We can combine the DataFrame in each folder by using `pd.concat`, which takes a list of dataframes and stacks them. 

In [1]:
result = pd.concat(dfs)

NameError: name 'pd' is not defined

## Exporting files

A DataFrame can be exported to a csv (or other filetype) by using `df.to_csv()`

In [None]:
result.to_csv('result.csv')

Congratulations! We now have covered loading in single and multiple files, and common strategies for getting data into our Python environment for processing. Now it is possible to use this data for analysis in the future.