# Parquet File Format

To store a pandas DataFrame in a Parquet file, you can use the `to_parquet()` method in pandas. Here's an example:

```python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Store DataFrame as Parquet file
df.to_parquet('data.parquet')
```

### Importance of Parquet File Format

1. **Efficient Storage**:
   - Parquet is a columnar storage file format that is highly efficient for large datasets. It compresses data, leading to smaller file sizes compared to other formats like CSV.

2. **Performance**:
   - Due to its columnar nature, Parquet allows faster query performance, especially for operations that only need a subset of columns. This can significantly reduce I/O operations.

3. **Schema Evolution**:
   - Parquet supports schema evolution, meaning you can add or modify columns without rewriting the entire dataset. This is beneficial when dealing with data that changes over time.

4. **Compatibility**:
   - Parquet is widely supported across various big data tools and platforms like Apache Hadoop, Spark, and even cloud storage services like Amazon S3. This makes it a versatile format for data sharing and processing.

5. **Self-Describing**:
   - Parquet files include metadata about the data schema and the compression methods used. This self-describing nature makes it easier to interpret the file without external documentation.

6. **Splittable**:
   - Parquet files are splittable, meaning they can be read in parallel by multiple processes. This feature is especially useful in distributed computing environments where processing speed is crucial.

Overall, Parquet is an excellent choice for storing and managing large, complex datasets due to its efficient storage, performance benefits, and compatibility with modern data processing systems.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/anajikadam/MyRowData/main/Red-WineQT_Data.csv")
df.shape

(1143, 13)

In [3]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [7]:
path = "Red-WineQT_Data.csv"
df.to_csv(path)

In [4]:
path = "Red-WineQT_Data.parquet"
df.to_parquet(path)

#### Red-WineQT_Data.csv ==> 77.6 kB
#### Red-WineQT_Data.parquet ==>35.5 kB

In [10]:
%%timeit
path = "Red-WineQT_Data.parquet"
df1 = pd.read_parquet(path)
df1.shape

5.24 ms ± 79.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
df1.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [11]:
%%timeit
path = "Red-WineQT_Data.csv"
df1 = pd.read_csv(path)
df1.shape

5.21 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


When choosing a file format for storing pandas DataFrames, you have several options, each with its own advantages and disadvantages. Here’s a comparison of **Parquet** with **Pickle**, **Feather**, and **CSV**:

### 1. **Parquet**
   - **Format**: Columnar storage format.
   - **Compression**: Supports various compression algorithms like Snappy, Gzip, Brotli, etc.
   - **Storage Efficiency**: High (compressed and columnar).
   - **Performance**: Fast for both read and write, especially for column-specific queries.
   - **Compatibility**: Widely supported in big data ecosystems (Hadoop, Spark, etc.).
   - **Schema Evolution**: Supports schema evolution.
   - **Use Cases**: Large datasets, distributed systems, big data analytics.

### 2. **Pickle**
   - **Format**: Binary serialization format native to Python.
   - **Compression**: None by default, but can be combined with other compression methods (e.g., gzip).
   - **Storage Efficiency**: Moderate (depends on data type).
   - **Performance**: Fastest for both reading and writing among these options because it directly serializes Python objects.
   - **Compatibility**: Python-specific; not easily readable by other languages or systems.
   - **Schema Evolution**: No built-in support; deserialization issues can occur if the data structure changes.
   - **Use Cases**: Quick storage and retrieval of Python objects within a Python environment.

### 3. **Feather**
   - **Format**: Columnar storage format, part of the Apache Arrow project.
   - **Compression**: None by default in older versions; newer versions may support it.
   - **Storage Efficiency**: High (especially for memory-mapped file access).
   - **Performance**: Very fast for both read and write operations due to its columnar nature and memory mapping.
   - **Compatibility**: Supports interoperability between Python and R (via Apache Arrow).
   - **Schema Evolution**: Limited; designed for simple data exchange.
   - **Use Cases**: Fast data exchange between Python and R, in-memory data analytics.

### 4. **CSV (Comma-Separated Values)**
   - **Format**: Plain text, row-based storage format.
   - **Compression**: None by default, but can be manually compressed (e.g., gzip).
   - **Storage Efficiency**: Low (uncompressed and row-based).
   - **Performance**: Slow for both read and write, especially with large datasets due to lack of compression and row-based storage.
   - **Compatibility**: Universally readable by most systems, applications, and programming languages.
   - **Schema Evolution**: No support; schema changes need manual adjustments.
   - **Use Cases**: Simple data storage and sharing, data export/import, human-readable format.

### Summary of Use Cases:

- **Parquet**: Best for large datasets and columnar queries in distributed computing environments.
- **Pickle**: Best for fast storage and retrieval of Python-specific objects within Python applications.
- **Feather**: Best for high-performance data interchange between Python and R or when working with in-memory analytics.
- **CSV**: Best for simple data exchange and human-readable storage, but less efficient for large datasets.

Each format has its strengths depending on your use case. For example, if you're working in a Python-only environment and need quick serialization, Pickle is ideal. If you're handling big data in a distributed system, Parquet is the better choice.