# Resources & references

- The built-in [`open()`](https://docs.python.org/3/library/functions.html#open) function.
- The [io](https://docs.python.org/3/library/io.html) and [`csv`](https://docs.python.org/3/library/csv.html) modules from the Python Standard Library.

# (Native) `io`

[`open()`](https://docs.python.org/3/library/functions.html#open) is a built-in Python function that returns a [**file object**](https://docs.python.org/3/glossary.html#term-file-object).
- A file object is not the contents of the file itself.  It is a class with methods such as `read()` or `write()` that allow interaction with the file.  
- File objects are also called _file-like objects_ or **streams**.

## Using `with`

A file object is a context manager and therefore can be used with the [`with`](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement) statement.  In this example, `file` is closed after the `with` statement’s suite is finished—even if an exception occurs:

```python
with open('spam.txt', 'w') as file:
    file.write('Spam and eggs!')
```

The alternate would be:

```python
file = open('spam.txt', 'w')
file.write('Spam and eggs!')
file.close()
```

## File object methods

| Method/attr              | Description                                                                                                |
|--------------------------|:-----------------------------------------------------------------------------------------------------------|
| `read(size=-1)`          | Read up to `size` bytes from the object and return them.                                                   |
| `write(b)`               | Write the given bytes-like object, `b`, to  underlying raw stream, and return the number of bytes written. |
| `truncate(size=None)`    | Resize the stream to the given `size` in bytes (or the current position if `size` is not specified).       |
| `close()`                | Flush and close this stream.                                                                               |
| `readline(size=1)`       | Read and return one line from the stream.                                                                  |
| `readlines(hint=-1)`     | Read and return a list of lines from the stream.                                                           |
| `writelines(lines)`      | Write a list of lines to the stream.                                                                       |
| `seek(offset[, whence])` | Change the stream position to the given byte `offset`.                                                     |
| `readable()`             | Return `True` if the stream can be read from. If `False`, `read()` will raise `OSError`.                   |
| `writeable()`            | Return `True` if the stream supports writing. If `False`, `write()` and `truncate()` will raise `OSError`. |
| `closed`                 | `True` if stream is closed.                                                                                |

### Reading and writing

Many of these are best illustrated with code.  We'll begin with an empty file `ex.txt`:

In [1]:
def print_file(f):
    res = open(f).read()
    if not res:
        res = '[File is empty]'
    print(res)

Note above that we *don't* need `with` or `close()` when we use `open(file).read()` as a one-liner.  We're not assigning the file object to any variable.

In [2]:
import os
os.chdir('./docs/tutorials/imgs')
os.getcwd()

'/Users/brad/Scripts/python/docs/tutorials/imgs'

In [4]:
# First make sure text file is empty - open in write mode to truncate
path = 'ex.txt'
f = open(path, 'w')
f.close()
    
print_file(path)  # Begin our example with an empty file

[File is empty]


Now let's open the file in read mode.  This is the default with `open()`; the syntax is:

```python
open(file, mode='r', buffering=-1, encoding=None, errors=None, 
     newline=None, closefd=True, opener=None)
```

Opening in read mode prohobits writing (and truncating):

In [5]:
f = open(path)  # default mode='r'
print('readable:', f.readable())
print('writable:', f.writable())

readable: True
writable: False


In [6]:
f.write('some text')  # This will raise
f.truncate()  # Will also raise

UnsupportedOperation: not writable

Similarly, in write mode, we cannot read:

In [7]:
f.close()
f = open(path, mode='w')
print('readable:', f.readable())
print('writable:', f.writable())

readable: False
writable: True


In [8]:
print(f.read())  # Will raise

UnsupportedOperation: not readable

`write()` writes to the file _and_ returns the number of bytes written:

In [9]:
f.write('First line of text')  # Returns num of bytes written

18

Now let's close, re-open, and read:

In [10]:
f.close()
print_file(path)

First line of text


Opening in write mode **truncates the file automatically if it exists**:

In [11]:
f = open(path, mode='w')  # Truncates
f.close()
print_file(path)

[File is empty]


### `seek()`

File objects are position-aware. `seek(offset)` changes the **stream position** to `offset` and_ also returns the new absolute position_.

In [12]:
assert f.closed
print_file(path)
f = open(path, mode='w')
f.writelines([
    'This is the first line.\n'
    'This is line 2.\n'
    'There is a lot of fun to be had in here.'
    ])
f.close()
print_file(path)

[File is empty]
This is the first line.
This is line 2.
There is a lot of fun to be had in here.


`readline()` scans each byte of the file until it finds a newline character.

In [13]:
f = open(path)
# Think of this like `next()`
print(f.readline())
print(f.readline())
f.seek(0)  # "Rewind"; also returns new absolute position
print(f.readline(), '[again]')

This is the first line.

This is line 2.

This is the first line.
 [again]


Again: a file object supports iteration:

In [14]:
f.close()
with open(path) as f:
    for line in f:
        print(line)
assert f.closed

This is the first line.

This is line 2.

There is a lot of fun to be had in here.


## Encoding

Return the encoding used for text data, according to user preferences, on your system:

In [15]:
import locale
print(locale.getpreferredencoding(False))

UTF-8


# `StringIO`

`StringIO` was a [standalone](https://docs.python.org/2/library/stringio.html) module in Python 2.  In 3, it is [merged](https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit) into `io`.  The `try/except` import for compatability is:

In [16]:
try:
    from StringIO import StringIO  # Python 2
except ImportError:
    from io import StringIO  # Python 3

You should use `io.StringIO` or `io.BytesIO` for text and data respectively.

In [17]:
text = """Field1, Field2, Field3
1, 2, 3"""
print(text)
print()
print('raw:', text.__repr__())
s = StringIO(text)
print(type(s))

Field1, Field2, Field3
1, 2, 3

raw: 'Field1, Field2, Field3\n1, 2, 3'
<class '_io.StringIO'>


# `csv`

## CSV formatting rules

[RFC 4180](https://tools.ietf.org/html/rfc4180.html) proposes a specification for the CSV format, and this is the definition commonly used. However, in popular usage "CSV" is not a single, well-defined format.  For instance, Excel (in Windows) has its own CSV format.

Consider 3 .csv files, rendered in Excel as follows:

<center>**_book1.csv_**:</center>

![book1](./imgs//book1.PNG)

<center>**_book2.csv_**:</center>

![book2](./imgs//book2.PNG)

<center>**_book3.csv_**:</center>

![book1](./imgs//book3.PNG)

In plain text files these would like like so:

In [19]:
def print_both(path):
    with open(path) as f:
        contents = f.read()
        print(contents.__repr__())
        print()
        print(contents)

_book1_:

In [20]:
print_both('book1.csv')

'\ufefftext'

﻿text


_book2_:

In [21]:
print_both('book2.csv')

'text,text\n'

text,text



In [22]:
print_both('book3.csv')

'text,text\na,b\n1,2\n'

text,text
a,b
1,2



This breaks down to the following structure:
```
field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
```
where `CRLF` is a line break and the header row is optional.

## The `csv` module

The Standard Library's [`csv`](https://docs.python.org/3/library/csv.html) module is designed for reading and writing comma-separated value (CSV) files.   The csv module’s `reader` and `writer` objects read and write sequences. Programmers can also read and write data in dictionary form using the `DictReader` and `DictWriter` classes.

In [23]:
import csv
for obj in csv.__all__:
    if not any((obj.startswith('_'), obj.isupper())):
        print(obj)

Error
Dialect
excel
excel_tab
field_size_limit
reader
writer
register_dialect
get_dialect
list_dialects
Sniffer
unregister_dialect
DictReader
DictWriter
unix_dialect


**`csv.reader`** returns a reader object which will iterate over lines in the given param `csvfile`.  This can be commbined with a `StringIO` object.

In [24]:
reader = csv.reader(StringIO(text))
# reader.line_num == 0
for row in reader:
    print(row)

['Field1', ' Field2', ' Field3']
['1', ' 2', ' 3']


Note that if you just passed the raw string, the iteration would occur over each letter.  (`csvfile` can be any object which supports the iterator protocol and returns a string each time its `__next__()` method is called:

In [25]:
# Get first 5 elements
reader = csv.reader(text)
i = 0
while i < 5:
    print(reader.__next__())
    i += 1

['F']
['i']
['e']
['l']
['d']


So, the module does not support directly parsing strings, but sometimes you can just pass one within a list, because a list is a valid first argument to `csv.reader`:

In [26]:
for row in csv.reader(['one,two,three']):  # Can't have newlines here
    print(row)

['one', 'two', 'three']


**`csv.DictReader`** is similar to `csv.reader`, except that it returns an `OrderedDict` whose keys are given by the optional `fieldnames` parameter.

In [27]:
text = \
    """Field1, Field2, Field3
    1, 2, 3
    4, 5, 6
    a, b, c"""
reader = csv.DictReader(StringIO(text))
# `reader` itself is just a DictReader object, not an OrderedDict
for row in reader:
    print(row)

OrderedDict([('Field1', '    1'), (' Field2', ' 2'), (' Field3', ' 3')])
OrderedDict([('Field1', '    4'), (' Field2', ' 5'), (' Field3', ' 6')])
OrderedDict([('Field1', '    a'), (' Field2', ' b'), (' Field3', ' c')])


The [`Sniffer`](https://docs.python.org/3/library/csv.html#csv.Sniffer) class deduces the format of a CSV:

In [28]:
# Make sure we're not at end of stream
s.seek(0)
csv.Sniffer().has_header(s.readline())

True

Note: `.read()` will also work here, but it is [safer](https://stackoverflow.com/a/35757505/7954504) to just supply the first row.

# File I/O in pandas

In [None]:
import pandas as pd

Pandas i/o tools are too extensive to cover here.  For references:

* [pandas IO tools](http://pandas.pydata.org/pandas-docs/stable/io.html)
* [pandas IO api reference](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)

A couple of helpful notes:

The first parameter to `pandas.read_csv` is `filepath`.  This can be:
- A path to a file (a **`str`**, `pathlib.Path`, or `py._path.local.LocalPath`)
- A **URL** (including http, ftp, and S3 locations)
- **Or any object with a `read()` method (such as an open file or `StringIO`)**

`pandas.read_csv` can be used with `StringIO`.

In [None]:
testdata = StringIO(
    """col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140"""
    )
df = pd.read_csv(testdata, sep=";")  # Don't need to `.read()`
print(df)

In [None]:
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
print(pd.read_csv(StringIO(data), 
                  usecols=lambda x: x.upper() in ['COL1', 'COL3']))