# 1. flat files

### text files .txt - containing plain text

basic text files containing records, that is, table data, without structured relationships.
record == row of fields or attributes, each of which contains at most one item of information. 

```
filename = 'plaintextfile.txt'
file = open(filename, mode='r') # 'r' is to read, to write pass 'w'...
text = file.read() # this contains the whole text
file.close()
```

* You can avoid having to close the connection to the file by using a context manager 

```
with open('plaintextfile.txt') as file:
    print(file.readline())
```

### files .csv - comma separated value

Values in flat files can be separated by characters or sequences of characters other than commas, such as a tab, and the character or characters in question is called a delimiter.


# 2. files native to specific software

### Pickled files

This is a file type native to Python. 
While it may be easy to save a numpy array or a pandas dataframe to a flat file, there are many other datatypes, such as dictionaries and lists, for which it isn't obvious how to store them. If you only want to be able to import them into Python, you can serialize them. All this means is converting the object into a sequence of bytes, or bytestream non human-readable)

```
import pickle
with open('pickled_file.pkl', 'rb') as file:    #'rb' to specify both readable and binary
data = pickle.load(file)
print(data)
```

### Excel spreadsheets

better with pandas

### Stata

* Stata: “Statistics” + “data”
* academic social sciences research

### SAS

* SAS: Statistical Analysis System
* business analytics and biostatistics

### HDF5 files
Hierarchical Data Format version 5
Standard for storing large quantities of numerical data
Datasets can be hundreds of gigabytes or terabytes
HDF5 can scale to exabytes

```
import h5py
filename = 'hdf5file.hdf5'
data = h5py.File(filename, 'r') # 'r' is to read
print(data.keys())
print(data['key1'].keys()
```

### MATLAB 

* “Matrix Laboratory”
* Industry standard in engineering and science
* Data saved as `.mat` files

# 3.  relational databases 

### SQLite and PostgreSQL

```
from sqlalchemy import create_engine
engine = create_engine('sqlite:///databasename.sqlite')
table_names = engine.table_names()
```


# importing files in numpy

### `loadtxt()` is great for basic cases, same datatypes in flat file

* the default `delimiter` is any white space

```
import numpy as np
filename = 'plain.txt'
data = np.loadtxt(filename, delimiter=',', skiprows=1, , usecols=[0, 2], dtype=str)
data2 = np.loadtxt(file, delimiter='\t', skiprows=2, usecols=[1,2])
```

### `genfromtxt()` can handle mixed datatypes in flat files

```
data = np.genfromtxt('plain.csv', delimiter=',', names=True, dtype=None)
```
* the third argument names tells us there is a header. 

data is an object called a **structured array**, as the data are of different types, and numpy arrays have to contain elements that are all the same type. 
The structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported. 
You can test this by checking out the array's shape in the shell by executing np.shape(data)

### `recfromcsv()` 

behaves similarly to `np.genfromtxt()`, except that its default `dtype` is `None`



# importing files in pandas

```
import pandas as pd

# Assign the filename: file
file = 'plain.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file, nrows=5, header=None, names=[a,b,c], index_col=0, , sep='\t', comment='#', na_values='Nothing')
```

### excel

```
import pandas as pd
file = 'ex.xlsx'
data = pd.ExcelFile(file)
print(data.sheet_names)
df1 = data.parse('tree', skiprows=[0], names=['a' , 'b']) # sheet name, as a string
df2 = data.parse(0,usecols=[0], skiprows=[0], names=['zzz']) # sheet index, as a float
```
* pd.read_excel() defaults to sheet 0
* passig sheet_name=None we get a Dict whose keys are the sheets
```
df  = pd.read_excel(file, sheet_name='flaksjdhf')
df4 = pd.read_excel(open('ex.xlsx', 'rb'), dtype={'Name': str, 'Value': float}, na_values=['string1', 'string2'])
```

### SAS

```
import pandas as pd
from sas7bdat import SAS7BDAT
with SAS7BDAT('sasfile.sas7bdat') as file:
df_sas = file.to_data_frame()
```

### Stata

```
import pandas as pd
data = pd.read_stata('statafile.dta')
```

### SQLite

```
from sqlalchemy import create_engine
import pandas as pd 
engine = create_engine('sqlite:///dbname.sqlite')
con = engine.connect()
rs = con.execute("SELECT * FROM table")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
con.close()
```
* context manager
```
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///dbname.sqlite')
with engine.connect() as con:
    rs = con.execute("SELECT col1, col2 FROM table")
    df = pd.DataFrame(rs.fetchmany(size=5))
    df.columns = rs.keys()
```

```
df = pd.read_sql_query("SELECT * FROM table", engine)
```

# in scipy

### MATLAB files
`scipy.io.loadmat()` - load `.mat` files as a dict where 
keys = MATLAB variable names
values = objects assigned to variables (like numpy arrays)
`scipy.io.savemat()` - write `.mat` files