In [15]:
import numpy as np 
import pandas as pd 

we can store data in binary formats to improve efficiency, speed, and compression compared to plain text formats like CSV.

---

##### Main Ways to Store Data in Binary Formats

| Format   | File Extension | Speed   | Compression | Notes                             |
|----------|----------------|---------|-------------|-----------------------------------|
| Pickle   | `.pkl`         | Fast    | ❌          | Not secure for untrusted sources |
| Feather  | `.feather`     | Very Fast | ❌        | Good for R/Python data exchange  |
| Parquet  | `.parquet`     | Fast    | ✅         | Best for large columnar data     |
| HDF5     | `.h5`          | Medium  | ✅         | Supports partial I/O and indexing|


In [16]:
frame = pd.read_csv("examples/ex1.csv")
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [17]:
frame.to_pickle("examples/frame_pickle")

# Pickle files are in general readable only in Python. You can read any “pickled” object stored in a file by using the built-in pickle directly, or even more conveniently using pandas.read_pickle
pd.read_pickle("examples/frame_pickle")

# pickle is recommended only as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time.

# an object pickled today may not unpickle with a later version of a library.

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


## [ Reading Microsoft Excel Files ]
- use either the pandas.ExcelFile class or pandas.read_excel function. 
- these tools use the add-on packages xlrd and openpyxl to read old-style XLS and newer XLSX files, respectively. 
- these must be installed separately from pandas 

In [18]:
xlsx = pd.ExcelFile("examples/excel.xlsx")
xlsx.sheet_names # this object will show the list of sheet names in the file

['Sheet1', 'Sheet2']

In [19]:
xlsx.parse(sheet_name="Sheet1")

Unnamed: 0,Text,Date,Number,Currency,Time,Percentage,Forumula
0,Row 1,2017-01-01,100,100,11:00:00,0.1,10
1,Row 2,2017-01-02,200,200,12:00:00,0.2,40


In [20]:
xlsx.parse(sheet_name="Sheet1", index_col=0)

Unnamed: 0_level_0,Date,Number,Currency,Time,Percentage,Forumula
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Row 1,2017-01-01,100,100,11:00:00,0.1,10
Row 2,2017-01-02,200,200,12:00:00,0.2,40


In [21]:
# we can also simply pass the filename to pandas.read_excel
frame = pd.read_excel("examples/excel.xlsx", sheet_name="Sheet2")
frame

Unnamed: 0,Text,Date,Number,Currency,Time,Percentage,Forumula
0,fsdf,2017-01-01,100,100,11:00:00,0.1,10
1,hghf,2017-01-02,200,200,12:00:00,0.2,40


In [22]:
# sheet_name=None loads all sheets
all = pd.read_excel('examples/excel.xlsx', sheet_name=None)
all

{'Sheet1':     Text       Date  Number  Currency      Time  Percentage  Forumula
 0  Row 1 2017-01-01     100       100  11:00:00         0.1        10
 1  Row 2 2017-01-02     200       200  12:00:00         0.2        40,
 'Sheet2':    Text       Date  Number  Currency      Time  Percentage  Forumula
 0  fsdf 2017-01-01     100       100  11:00:00         0.1        10
 1  hghf 2017-01-02     200       200  12:00:00         0.2        40}

In [23]:
df = pd.DataFrame(np.random.randint(1, 101, size=(5, 10)), columns=[f'Col{i+1}' for i in range(10)])

# to write pandas data to Excel format, you must create an ExcelWriter, then write data to it using the pandas object's to_excel method
# use "with" context manager
with pd.ExcelWriter("examples/exl.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Sheet1")

# we can also pass a file path to to_excel and avoid the ExcelWriter
# df.to_excel("examples/exl.xlsx")

## [ HDF5 Format ]

- Designed for **storing large scientific array data** efficiently  
- Supports **hierarchical structure** (multiple datasets + metadata)  
- Offers **on-the-fly compression** to reduce file size  
- Allows **partial reading/writing**, useful for data that **doesn’t fit in memory**  
- Available in **many languages**, including **Python, Java, MATLAB, Julia**


In [27]:
# While it’s possible to directly access HDF5 files using either the PyTables or h5py libraries, pandas provides a high-level interface that simplifies storing Series and DataFrame objects. The HDFStore class works like a dictionary and handles the low-level details

frame = pd.DataFrame({"a": np.random.standard_normal(100)})
store = pd.HDFStore("examples/mydata.h5")

# saves the entire DataFrame into the HDF5 file under the key "obj1"
store["obj1"] = frame
# saves only the "a" column (a series) into the HDF5 file under the key "obj1_col"
store["obj1_col"] = frame["a"]
store


# alt
# with pd.HDFStore("examples/mydata.h5") as store:
#     store["obj1"] = frame
#     store["obj1_col"] = frame["a"]


<class 'pandas.io.pytables.HDFStore'>
File path: examples/mydata.h5

In [28]:
# objects contained in the HDF5 file can then be retrieved with the same dictionary like API
store["obj1"]

Unnamed: 0,a
0,-0.347560
1,-0.843887
2,-0.063945
3,-1.219969
4,-2.408657
...,...
95,-0.377715
96,-1.273042
97,-1.254519
98,-0.639585



 HDFStore Storage Schemes

1. **Fixed Format** (`format='fixed'`) – *Default*
- Fastest to read/write  
- Not queryable (no filtering using conditions)  
- Stores data as a serialized block  
- Best for simple storage when you don't need to query the file later

2. **Table Format** (`format='table'`)
- Supports querying, indexing, and appending  
- Slightly slower than fixed format  
- Ideal for large datasets and partial reads  
- Best for filtering rows using conditions like SQL


In [29]:
# saves the DataFrame under the key "obj2" and uses table format
store.put("obj2", frame, format="table")

store.select("obj2", where=["index >= 10 and index <= 15"])

Unnamed: 0,a
10,0.705679
11,-1.272541
12,-0.812817
13,-0.789321
14,-0.956059
15,-0.752552


In [30]:
store.close()
# closes the HDF5 file that was opened using pd.HDFStore()
# frees system resources like memory and file handles
# ensures that all buffered data is written to disk 

In [32]:
# pandas.read_hdf fn gives you a shortcut to these tools

# If you are processing data that is stored on remote servers, like Amazon S3 or HDFS, using a different binary format designed for distributed storage like Apache Parquet may be more suitable.