# Data Loading, Storage, and File Formats

*Data loading* term refers to read data and making it accessible. *Parsing* is also used to describe loading text data and interpreting it as tables and different data types.

Resources: https://github.com/wesm/pydata-book 

## Index

* [Reading and Writing Data in Text Format](#reading-and-writing-data-in-text-format)
    * [pandas.read_csv Functions](#some-pandasread_csv-function-arguments)
    * [Reading Text Files in Pieces](#reading-text-files-in-pieces)
    * [Writing Data to Text Format](#writing-data-to-text-format)
    * [Working with Other Delimited Formats](#working-with-other-delimited-formats)
    * [JSON data](#json-data)
    * [XML and HTML: Web Scraping](#xml-and-html-web-scraping)
* [Binary Data Formats](#binary-data-formats)
    

## Reading and Writing Data in Text Format

*Text and binary data loading functions in pandas*
|Function|Description|
|---|---|
|**read_csv** |**Load delimited data from a file, URL, or file-like object; use comma as default delimiter**|
|read_fwf |Read data in fixed-width column format (i.e., no delimiters)|
|read_clipboard |Variation of read_csv that reads data from the clipboard; useful for converting tables from web pages|
|read_excel |Read tabular data from an Excel XLS or XLSX file|
|read_hdf |Read HDF5 files written by pandas|
|read_html |Read all tables found in the given HTML document|
|**read_json** |**Read data from a JSON (JavaScript Object Notation) string representation, file, URL, or file-like object**|
|read_feather |Read the Feather binary file format|
|read_orc |Read the Apache ORC binary file format|
|read_parquet |Read the Apache Parquet binary file format|
|read_pickle |Read an object stored by pandas using the Python pickle format|
|read_sas |Read a SAS dataset stored in one of the SAS system’s custom storage formats|
|**read_spss** |Read a data file created by SPSS|
|read_sql |Read the results of a SQL query (using SQLAlchemy)|
|read_sql_table |Read a whole SQL table (using SQLAlchemy); equivalent to using a query that selects everything in that table using read_sql|
|read_stata |Read a dataset from Stata file format|
|read_xml |Read a table of data from an XML file|

Some of this functions has a long list of optional arguments, `pandas.read_csv()` has around 50, so ig you are struggling to read a particular file you can look online to found your optimal arguments.

```python
# some examples:

# The csv file has not headers, it will read with default names
pd.read_csv("example.csv", header=None)
# or you can specify names
pd.read_csv("example.csv", names["a", "b", "c", "d", "message"],
            index_col="message")
# the argument 'index_col="message"' to indicate your index column

# Pass multiple col names (list) for a hierarchical index

# You can pass a regular expression as delimeter for pandas
# use sep="\s+" if the file are separated for a variable amount of whitespaces

# with skiprows=[2, 3, 5] you can skip that rows
pd.read_csv("example.csv", skiprows=[2, 3, 5])
```

#### *Some pandas.read_csv function arguments*
|Argument|Description|
|---|---|
|path |String indicating filesystem location, URL, or file-like object.|
|sep or delimiter |Character sequence or regular expression to use to split fields in each row.|
|header |Row number to use as column names; defaults to 0 (first row), but should be None if there is no header row.|
|index_col |Column numbers or names to use as the row index in the result; can be a single name/number or a list of them for a hierarchical index.|
|names |List of column names for result.|
|skiprows| Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip.|
|na_values |Sequence of values to replace with NA. They are added to the default list unless keep_default_na=False is passed.|
|keep_default_na |Whether to use the default NA value list or not (True by default).|
|comment| Character(s) to split comments off the end of lines.|
|parse_dates |Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. Otherwise, can specify a list of column numbers or names to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g., if date/time split across two columns).|
|keep_date_col |If joining columns to parse date, keep the joined columns; False by default.|
|converters |Dictionary containing column number or name mapping to functions (e.g., {"foo": f} would apply the function f to all values in the "foo" column).|
|dayfirst |When parsing potentially ambiguous dates, treat as international format (e.g., 7/6/2012 -> June 7, 2012); False by default.|
|date_parser |Function to use to parse dates.|
|nrows |Number of rows to read from beginning of file (not counting the header).|
|iterator |Return a TextFileReader object for reading the file piecemeal. This object can also be used with the with statement.|
|chunksize |For iteration, size of file chunks.|
|skip_footer |Number of lines to ignore at end of file.|
|verbose |Print various parsing information, like the time spent in each stage of the file conversion and memory use information.|
|encoding |Text encoding (e.g., "utf-8 for UTF-8 encoded text). Defaults to "utf-8" if None.|
|squeeze |If the parsed data contains only one column, return a Series.|
|thousands |Separator for thousands (e.g., "," or "."); default is None.|
|decimal| Decimal separator in numbers (e.g., "." or ","); default is ".".|
|engine |CSV parsing and conversion engine to use; can be one of "c", "python", or "pyarrow". The default is "c", though the newer "pyarrow" engine can parse some files much faster. The "python" engine is slower but supports some features that the other engines do not.|

### Reading Text Files in Pieces

Processing a large files you may want to read a small piece or it. But before you can limit pandas display: `pd.option.display.max_rows=10`


In [2]:
import pandas as pd 
import numpy as np 

# limit the number of rows you want to see
result = pd.read_csv("../datasets/ex6.csv", nrows=5)

print(result)

        one       two     three      four key
0  0.467976 -0.038649 -0.295344 -1.824726   L
1 -0.358893  1.404453  0.704965 -0.200638   B
2 -0.501840  0.659254 -0.421691 -0.057688   G
3  0.204886  1.074134  1.388361 -0.982404   R
4  0.354628 -0.133116  0.283763 -0.837063   Q


In [4]:
import pandas as pd 
import numpy as np 

# to read a file in pieces, chunksize as a number of rows
chunker = pd.read_csv("../datasets/ex6.csv", chunksize=1000)

# Creating empty Series
tot = pd.Series([], dtype='int64')

# Exploring 'chunker' and adding key counts to the series
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

print(tot[:10])

key
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64


### Writing Data to Text Format

```python
# Variable with data:
data = pd.read_csv("example.csv")

# Exporting to csv:
data.to_csv("example_out.csv")

# to write in the CSV you want to export, you'll need 'import sys'
# Changing delimiters:
data.to_csv(sys.stdout, sep="|")

# Filling NA values with string:
data.to_csv(sys,stdout, na_rep="NULL")

# By dafault, to_csv() write column and row labels. You can disabled with:
data.to_csv(sys.stdout, index=False, header=False)

# Choosing columns and it's order:
data.to_csv(sys.stdout, index=False, columns=["a". "b", "c"])
```

### Working with Other Delimited Formats

In some cases, `pandas.read_csv()` will not work because the file CSV has malformed lines. In this cases you will need manual processing for the CSV file with `import csv` and *for* loop.

*CSV dialect options*
|Argument|Description|
|---|---|
|delimiter |One-character string to separate fields; defaults to ",".|
|lineterminator |Line terminator for writing; defaults to "\r\n". Reader ignores this and recognizes cross-platform line terminators.|
|quotechar |Quote character for fields with special characters (like a delimiter); default is '"'.|
|quoting |Quoting convention. Options include csv.QUOTE_ALL (quote all fields), csv.QUOTE_MINI MAL (only fields with special characters like the delimiter), csv. UOTE_NONNUMERIC, and csv.QUOTE_NONE (no quoting). See Python’s documentation for full details. Defaults to QUOTE_MINIMAL.|
|skipinitialspace |Ignore whitespace after each delimiter; default is False.|
|doublequote |How to handle quoting character inside a field; if True, it is doubled (see online documentation for full detail and behavior).|
|escapechar |String to escape the delimiter if quoting is set to csv.QUOTE_NONE; disabled by default.|

In [5]:
import csv 

# Opening and reading CSV file
file = open("../datasets/ex7.csv")
reader = csv.reader(file)

# iterating line by line with 'for' loop 
for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


In [11]:
### Putting the data in needed form

import csv

with open("../datasets/ex7.csv") as file:
    lines = list(csv.reader(file))

# Selecting Header and Values:
header, values = lines[0], lines[1:]

# Dictionary comprehension
data_dict = {h: v for h, v in zip(header, zip(*values))}
"""
    in larger files use a lot of memory
    'h' and 'v' represents header and values
"""

print(f"New data: \n{data_dict}")

[['a', 'b', 'c', 'd'], ['1', '2', '3', '4'], ['1', '2', '3', '4'], ['1', '2', '3', '4'], ['1', '2', '3', '4'], ['1', '2', '3', '4'], ['1', '2', '3', '4']]
New data: 
{'a': ('1', '1', '1', '1', '1', '1'), 'b': ('2', '2', '2', '2', '2', '2'), 'c': ('3', '3', '3', '3', '3', '3'), 'd': ('4', '4', '4', '4', '4', '4')}


### JSON Data

```python
import json

# Reading code written in the python file (python object)
data = {...} 
result = json.loads(data)
# Convert json to python object
asjson = json.dumps(result)

import pandas as pd 
# Converting JSON datasets into Series or DataFrame
df = pd.read_json("example.json")

# Nested JSON:
with open("example.json", "r") as json_file:
    data = json.load(json_file)

content = data["companies"]
df = pd.DataFrame()

for item in content:
    temp_df = pd.json_normalize(item, record_path=["employees"], meta=["company"])
    df = pd.concat([df, temp_df], ignore_index=True)
    
print(df)

# Export data from pandas to JSON, there are two formats:
df.to_json(sys.stdout)
df.to_json(sys.stdout, orient="records")
```



### XML and HTML: Web Scraping

There are many python libraries for read html and xml (lxml, Beautiful Soup, html5lib...), and padas has `pandas.read_html` to parse tables out of HTML files as DataFrame objects. We will need some additional libraries:

```python
conda install lxml beautifulsoup4 html5lib
# or
pip install lxml 
```

*HTML extrated from https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/*



In [14]:
import pandas as pd 

# Reading HTML
tables = pd.read_html("../datasets/fdic_failed_bank_list.html")

print(len(tables))

fails = tables[0]

print(fails.head())
# Pandas will insert a line break character '\' because it has many columns:

1
                      Bank Name             City  ST   CERT  \
0                   Allied Bank         Mulberry  AR     91   
1  The Woodbury Banking Company         Woodbury  GA  11297   
2        First CornerStone Bank  King of Prussia  PA  35312   
3            Trust Company Bank          Memphis  TN   9956   
4    North Milwaukee State Bank        Milwaukee  WI  20364   

                 Acquiring Institution        Closing Date       Updated Date  
0                         Today's Bank  September 23, 2016  November 17, 2016  
1                          United Bank     August 19, 2016  November 17, 2016  
2  First-Citizens Bank & Trust Company         May 6, 2016  September 6, 2016  
3           The Bank of Fayette County      April 29, 2016  September 6, 2016  
4  First-Citizens Bank & Trust Company      March 11, 2016      June 16, 2016  


## Binary Data Formats

Pickle format is only recommended as a short term storage format. pandas object all have a *to_pickle* method that writes the data disk in picle format.
```python
# DataFrame to pickle
frame.to_pickle("example")
# Reading
pd.read_pickle("example")
```

### Reading Microsoft Excel Files

You can read Excel with pandas.ExcelFile class or pandas.read_excel function. pandas require some packages to do that: `conda install openpyxl xlrd`.

In [3]:
import pandas as pd 
import numpy as np 

# Create an instance with xlsx file path
xlsx = pd.ExcelFile("../datasets/ex1.xlsx")

print(f"Sheet names in xlsx file: {xlsx.sheet_names}")

# Reading data stored in a sheet with 'parse'
# selecting index_col
sheet = xlsx.parse(sheet_name="Sheet1", index_col=0)
print(f"\n{sheet}")

# The alternative option is pd.read_excel 
frame = pd.read_excel("../datasets/ex1.xlsx", sheet_name="Sheet1")
print(f"\nPrinting pd.read_excel: \n{frame}")

Sheet names in xlsx file: ['Sheet1']

   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

Printing pd.read_excel: 
   Unnamed: 0  a   b   c   d message
0           0  1   2   3   4   hello
1           1  5   6   7   8   world
2           2  9  10  11  12     foo
