# Parse Large CSV

## Summary
- Loading the `customer-2000000.csv` file at once using `pd.read_csv` took 8.62 seconds and consumed 1506.04 MB of memory.
- Converting `Subscription Date` column to datetime object took 0.36 seconds and reduced memory usage by 112.5 MB.
- Using chunk approach with chunksize 100000 consumed around 75 MB per chunk, helping to keep memory usage light but requiring additional handling for processing or aggregation accross chunk.
- Alternative approach such using `Dask` or `Polar` module cona be used for better memory efficiency.

In [1]:
import pandas as pd
from time import time

In [2]:
file_path = "data/customers-2000000.csv"

## Pandas direct load approach

### Load csv file and check memory usage

In [3]:
"""Load csv file"""
start = time()
df = pd.read_csv(file_path)
end = time()
print(f"Data loaded in : {end-start:.2f} seconds")

Data loaded in : 10.86 seconds


In [4]:
"""Check memory usage"""
mem_usage = df.memory_usage(deep=True).sum() / (1024**2)
print(f"Memory usage : {mem_usage:.2f} MB")

Memory usage : 1506.04 MB


In [5]:
start = time()
"""Convert date string to date object"""
df["Subscription Date"] = pd.to_datetime(df["Subscription Date"],
                                            errors="coerce")
end = time()
print(f"Subscription Date column converted in : {end-start:.2f} seconds")

Subscription Date column converted in : 0.39 seconds


In [6]:
"""Check memory usage"""
mem_usage = df.memory_usage(deep=True).sum() / (1024**2)
print(f"Memory usage after converted Subscription Date column : {mem_usage:.2f} MB")

Memory usage after converted Subscription Date column : 1393.51 MB


## Pandas chunk load approach

In [7]:
total_mem_usage = 0
start = time()
with pd.read_csv(file_path, chunksize=100000) as reader:
    for chunk in reader:
        mem_usage = chunk.memory_usage(deep=True).sum() / (1024**2)
        print(f"Chunk memory usage : {mem_usage:.2f} MB")
        total_mem_usage+=mem_usage
end = time()
print(f"Data loaded in : {end-start:.2f} seconds")
print(f"Total memory usage : {total_mem_usage:.2f} MB")

Chunk memory usage : 75.30 MB
Chunk memory usage : 75.29 MB
Chunk memory usage : 75.30 MB
Chunk memory usage : 75.31 MB
Chunk memory usage : 75.30 MB
Chunk memory usage : 75.30 MB
Chunk memory usage : 75.31 MB
Chunk memory usage : 75.31 MB
Chunk memory usage : 75.31 MB
Chunk memory usage : 75.30 MB
Chunk memory usage : 75.31 MB
Chunk memory usage : 75.31 MB
Chunk memory usage : 75.30 MB
Chunk memory usage : 75.30 MB
Chunk memory usage : 75.29 MB
Chunk memory usage : 75.29 MB
Chunk memory usage : 75.31 MB
Chunk memory usage : 75.30 MB
Chunk memory usage : 75.30 MB
Chunk memory usage : 75.30 MB
Data loaded in : 14.54 seconds
Total memory usage : 1506.04 MB
