# Class 11 Python Files

## 9.1 File Objects

> File objects can be used to access not only normal disk files, but also any other type of “file” that uses that abstraction.

> Remember, files are simply a contiguous sequence of bytes.

## 9.2 File Built-in Functions `open()` and `file()`

1. Basic format

`file_object = open(file_name, access_mode='r', buffering=-1)`

access_mode: `r`: read, `w`: write, `a`: append, `b`: byte, `U`: universal

if the file exists, `w` will remove contents inside, and write new content instead; `a` will ignore contents inside, and put new content behind.

> When you use the 'U' flag to open a file, all line separators (or terminators) will be returned by Python via any file input method, i.e., read*(), as a NEWLINE character ( \n ) regardless of what the line-endings are.

> Note that UNS only applies to reading text files.

In [15]:
f_name = '/Users/acepor/Work/online_course/notebook/unicode.txt'
data = open(f_name, 'r')
print(data)

<open file '/Users/acepor/Work/online_course/notebook/unicode.txt', mode 'r' at 0x10f9e3a50>


## 9.3 File Built-in Methods

> The read() method is used to read bytes directly into a string, reading at most the number of bytes indicated.

> The readlines() method does not return a string like the other two input methods. Instead, it reads all (remaining) lines and returns them as a list of strings.

> Line termination characters are not inserted between each line, so if desired, they must be added to the end of each line before writelines() is called.

> When reading lines in from a file using file input methods like read() or readlines(), Python does not remove the line termination characters.

> ! It is possible to lose output data that is buffered if you do not explicitly close a file.

In [34]:
data = open(f_name, 'r')
print('read: ')
print(data.read())
data = open(f_name, 'r')
print('readline: ')
print(data.readline())
data = open(f_name, 'r')
print('readlines: ')
print(data.readlines())
print
print('file loop:')
data = open(f_name, 'r')
for line in data:
    print(line)

print
with(open(f_name, 'a')) as f:
    f.write('Hiya')

read: 
Hello world
Bonjour
Hola

readline: 
Hello world

readlines: 
['Hello world\n', 'Bonjour\n', 'Hola\n']

file loop:
Hello world

Bonjour

Hola




## 9.4 Pandas

According to the [newest Pandas doc](http://pandas.pydata.org/pandas-docs/stable/io.html), Pandas supports reading and supporting these commonly-used file format: CSV, JSON, HTML, Local clipboard, MS Excel, HDF5 Format, Feather Format, Msgpack, Stata, SAS, Python Pickle Format, SQL, and Google Big Query. If we visualize these data formats, we can have a clearer idea:

![pandoc file conversion map](http://acepor.github.io/images/pandas_relations.png)

A comprehensive introduction of Pandas IO tools can be found [here](http://pandas.pydata.org/pandas-docs/stable/io.html). However, in this post, we will briefly introduce using Pandas to read / write some common file format.

### CSV

CSV (comma-separated-value) format is one of the most common formats in data processing. It is easy for both human and machine to read.


`data = pd.read_csv(in_file, quote=0, sep=',', engine='c')`


`quote` is to tell which quotation convention the data uses.

If the `sep` set as `None` and `engine` as 'python', this function will automatically sniff the delimiter.

`c` engine is much faster (at least 50%) than `python` engine, but `python` engine supports more features

`data.to_csv(out_file, header=True, index=False)`

If we want to keep header and index, we can set `header` and `index` as `True`, and vice versa.

### TSV

TSV (tab-separated-value) format is also very common, and Pandas can process it in a similar way as CSV.

`data = pd.read_table(in_file, quote=0, sep='\t', engine='c')`

### JSON

JSON has gain more popularity recently. It has more controls on data, but it is not very human-friendly. Because it has a number of orients, it is quite easy to get confused. Therefore, when we use Pandas to read a JSON file, we have to specify the orient. It could be `split`, `records`, `index`, `columns` or `values`. Moreover, it the file is line-based, we can set `lines` as `True`.

`data = pd.read_json(in_file, orient='records', lines=False)`

`data.to_json(out_file, orient='records', lines=False)`

### MySQL

MySQL is one of the most popular database, and `pandas` can easily read the data from it with the help of another Python library `sqlalchemy`.

First, we use sqlalchemy to make a MySQL connection.

`from sqlalchemy import create_engine`

`def connect_db(host):
    return create_engine(host)`

Then, we give a SQL query to pandas, and query from the created connection. Just that simple, we can easiily get the queried result to a `pandas` Dataframe.

`def mysql_df(sql, con):
    df = pd.read_sql_query(sql=sql, con=con)
    return df`

## Advantages

Using Pandas as a unified IO tool has two main advantages:

    1. Pandas IO tools provide a significant performance increase when reading or writing data.
    2. Pandas has very detailed document, so the learning curse is reduced.