# File Handling

To read and write external files, there is no need to import any additional libraries.<br>
Python contains a function **open()** in the standard library that will give you an object (\_io.TextIOWrapper) to use when reading and writing data to/from a file.<br>
In it's simplest form, to read a text file, it takes a single arguments: a filename.

In [None]:
open??
## as an aside, you can import the signature function in the inspect module to see 
## from inspect import signature
## signature(open)

So the function returns a stream.<br>
When we pass in a single argument, a filename, it opens a file for reading (more='r')
<pre>
========= ===============================================================
Character Meaning
--------- ---------------------------------------------------------------
'r'       open for reading (default)
'w'       open for writing, truncating the file first
'x'       create a new file and open it for writing
'a'       open for writing, appending to the end of the file if it exists
'b'       binary mode
't'       text mode (default)
'+'       open a disk file for updating (reading and writing)
'U'       universal newline mode (deprecated)
========= ===============================================================
</pre>

**Advanced remark**<br><br>
The io module provides Python’s main facilities for dealing with various types of I/O. There are three main types of I/O: text I/O, binary I/O and raw I/O.<br>
These are generic categories, and various backing stores can be used for each of them. A concrete object belonging to any of these categories is called a file object.<br>
Other common terms are stream and file-like object.<br>
See: [io documentation](https://docs.python.org/3/library/io.html)<br><br>
**Mortal usage**<br><br>
To keep thing simple, to read data from a file (stream) use
* **read()** to read the whole file in one go
* **readlines()** to read the file line by line

In [None]:
fh = open('data\iris.data')
## reads the complete file in one go
content = fh.read()
fh.close()
content[:100]

Often it is more usefull to read a file a line at a time and process that line.<br><br>
Another tip is to use a so called context manager:
<pre>
with open(...) as f:
    f.read...
some new commands
</pre>
This automatically closes the file once the scope of the context manager ends.<br>
So, combineing this we get:

In [None]:
content = []
with open('data\iris.data') as file:
    ## read file line by line
    for line in file:
        content.append(line)
## print the first 5 lines --> note the \n still part of each line --> use strip() to remove whitespace and \n at beginning / end of each line
content[:5]

In [None]:
## there is always a shorter way in Python, not nessesarily more readible --> create a list containing the first 10 lines and strip the \n :
[line.strip() for line_number, line in enumerate(open('data\iris.data')) if line_number < 10]

In [None]:
## or, how about a dictionary of the first 10 lines where the key is the line number
{line_number: line.strip() for line_number, line in enumerate(open('data\iris.data')) if line_number < 10}

In reality, when you read in a file, you are likely to encounter some problems and want to catch possible errors.<br>
So I find it usefull to wrap the processing that needs doing (usually splitting andf converting to the correct type) in a seperate function, say **process_line**.<br>
This also enable to print the lines containing issues, but keep going fo the lines that are ok!

In [None]:
## lets read the file line by line and convert colun 1 to 4 to float
## there is always a better way: context manager & iterating over the file handle stream
list_of_records = []

def process_line(ix, line):
    if line == '':
        print(f'warning: line {ix} was empty!')
        return(None)
    try:
        e1, e2, e3, e4, e5 = line.split(',')
        return(tuple([float(e1), float(e2), float(e3), float(e4), e5]))
    except:
        print(f'warning: line {ix} containing the string: {line.strip()}')
        return(None)

with open('data\iris.data') as fh:
    for ix, line in enumerate(fh):
        tpl = process_line(ix, line.strip())
        if tpl != None:
            list_of_records.append(tpl)

In [None]:
list_of_records[:5]

# Databases

## ODBC

Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS) . The designers of ODBC aimed to make it independent of database systems and operating systems.<br>
Many databases provide ODBC clients. Using one of these ODBC clients, you can add an ODBC connection and name it, this name is known as the DSN (Data Source Name).<br>
It should in principle be possible to open ODBC connection to SAS, Aster, Teradata, SQL Server, ... (I never managed to set up an ODBC connection to SAS).

![ODBC tool](plots/odbc.png)

In [None]:
## setting up a connection
cnx = pyodbc.connect('dsn=NBSDataTD', autocommit=True)
## from a connection we can request a cursor -> a cursor provides the 'context' of a fetch operation 
crs = cnx.cursor()

In [None]:
## drop test table if it already exists
## crs.execute('drop table prl_013_cattag_v2.test')

In [None]:
## the cursor provides the environment to delegate queries to the database and provide methods to access the table resulting from the query
crs.execute('create table prl_013_cattag_v2.test ( id int, val float )')
crs.execute('insert into prl_013_cattag_v2.test values (1, 1.1)')
crs.execute('insert into prl_013_cattag_v2.test values (2, 2.2)')
crs.execute('insert into prl_013_cattag_v2.test values (3, 3.3)')

Create a toy table:
<pre>
-- drop table prl_013_cattag_v2.test;
create table prl_013_cattag_v2.test ( id int, val float );
insert into prl_013_cattag_v2.test values (1, 1.1);
insert into prl_013_cattag_v2.test values (2, 2.2);
insert into prl_013_cattag_v2.test values (3, 3.3);
</pre>

In [None]:
## when the query returns data the cursor can be used to iterate over the rows
res = crs.execute('select * from prl_013_cattag_v2.test order by id')

In [None]:
[method_name for method_name in dir(res) if callable(getattr(res, method_name))]

In [None]:
for rec in res:
    print(rec)

In [None]:
## to fetch one row from the cursor call fetchone
crs.execute('select * from prl_013_cattag_v2.test order by id').fetchone()

In [None]:
## to fetch all row from the cursor call fetchone
crs.execute('select * from prl_013_cattag_v2.test order by id').fetchall()

In [None]:
## to get to the schema of the table that resulted from the query run description
crs.description

In [None]:
crs.execute('insert into prl_013_cattag_v2.test values (4, 66.99)')
## when a query does not return a table --> description returns None
crs.description

# Reading From Other Applications

## Pandas

One of the libraries that makes working with Python so much more user-friendly is pandas.<br>
There will be a seperate notebook just on Pandas. Untill then, do not worry too much about the details.<br>
Just note that Pandas, provides two data types: **Series** and **DataFrame**.<br>
A Series captures a sequence of data indexed by an index, usually and integer, but could also be a date, string, ....<br>
A DataFrame is a list of same-length sequences indexed by an index. It is the Python version of an R data.frame or an SQL table.

Since the output of a SQL query is always a table<br>
(this is what makes SQL so powerfull: input tables --> output tables, more formally closed under joining / merging / updating/ selecting / ...),<br>
Pandas provides a convenient way to work with Databases, namely the **read_sql** method.

### SQL

In [None]:
## first import Pandas --> note everybody uses the alias pd for pandas!
import pandas as pd

In [None]:
df = pd.read_sql('select * from prl_013_cattag_v2.test order by id', cnx)
df

### SAS

Pandas can read from a multitude of data formats, for instance SAS<br>
As an example, let's create a toy dataset in SAS (\*.sas7bdat file)
<pre>
libname temp '...'
data temp.sas_test;
    infile datalines delimiter=','; 
    input name $ dob :ddmmyy8. tscore nmatches;
	format dob date9.;
	avgscore = tscore / nmatches;
    datalines;
Lisa,01032011,211,3
Bart,.,431,5
Homer,21011993,510,9
Marge,17101995,747,9
Basil,17101995,.,12
;
</pre>

In [None]:
## note the encoding='utf-8' is needed!
df = pd.read_sas(r'\\csdatg03\tfs_code\p413248\discovery analytics\library\python\tutorial\data\sas_test.sas7bdat', encoding='utf-8')
df

### Delimited File

In [None]:
df = pd.read_csv('data\iris.data', header=None, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
df.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

species_to_color_map = {'Iris-setosa'    : 'blue', 'Iris-versicolor': 'orange', 'Iris-virginica' : 'green'}
fig, ax = plt.subplots(nrows=4, ncols=4, figsize=(25,15))

for x in range(4):
    for y in range(4):
        if x==y: 
            for grp, dfs in df.groupby('species'): 
                sns.distplot(dfs[dfs.columns[x]], ax=ax[x,y], hist=False, color=species_to_color_map[grp], label=grp)
            ax[x,y].legend()
        else:
            sns.scatterplot(x=df.iloc[:,x], y=df.iloc[:,y], hue=df.iloc[:,4], ax=ax[x,y])

## plt.show()

# Exchanging DataFrames Between Python & R

Probably the most efficient way to exchange dataframes between Python and R is through **feather**.<br>
Feather is build on top of the Apache Arrow project, an in memory columnar format envisaged to make sharing data between different tools seamless. See [apache arrow](https://arrow.apache.org/).<br>
Feather is a fast on-disk version implemented for both [Python](https://github.com/wesm/feather) and [R](https://blog.rstudio.com/2016/03/29/feather/).

In [None]:
import feather

In [None]:
feather.write_dataframe(df,'toy_data.feather')

In R:<br>
install.packages("feather")<br>
require(feather)<br>
df <- read_feather("toy_data.feather")<br>
df

# Persisting Python Objects: pickle

Apart from writing simple text data, or JSON, or XML, Python provides ways to persisting whole objects using the pickle library.<br>
Meaning, the state and all of it's methods. Lets create a more complex object to persist:

In [1]:
## a dictionary of functions
def my_add(a,b): return(a+b)
def my_sub(a,b): return(a-b)
def my_mul(a,b): return(a*b)
def my_div(a,b): return(a/b)
fncs = { '+': my_add, '-': my_sub, '*': my_mul, '/': my_div }

In [2]:
## this can be used like:
fncs['+'](1.1,22.22)

23.32

In [3]:
import pickle

In [4]:
pickle.dump(fncs, open('fncs.pickle','wb'))

In [5]:
my_func_straight_from_a_file = pickle.load(open('fncs.pickle','rb'))

In [6]:
my_func_straight_from_a_file['+'](1.1,22.22)

23.32

In [1]:
!del *.pickle

# Reading & Writing To/From non-local files: Sockets

As an advanced encore, there are two files in the script directory that use sockets to pass data around.<br>
Sockets are file abstractions where data is written and read -or in socket terminology- send and received, over a network.<br>
When combined with pickling, it opens posibilities to communicate and share objects between different machines on a network.<br>
Please see these files to explore further if you are interested.

To run the files:
* open two command windows in the script directory
* run: _python server.py_ to start the server and
* run: _python client.py_ to run the client