# Pandas: scaling to large datasets

In [None]:
import random
import string
import numpy as np
import pandas as pd
from datetime import datetime
import pathlib
%load_ext memory_profiler

Create a large dataset

In [None]:
%%time
def gen_random_string(length:int=32) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
    
def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):

    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
    n = len(index)
    np.random.seed = seed
    columns = {
        'cat': np.random.choice(['cat1','cat2','cat3','cat4','cat5'],n),
        'str1':[gen_random_string() for _ in range(n)],
        'str2':[gen_random_string() for _ in range(n)],
        'a': np.random.rand(n),
        'b': np.random.rand(n),
        'c': np.random.randint(1,100,n),
    }

    df = pd.DataFrame(columns, index=index, columns=sorted(columns))
    if df.index[-1] == end:
        df = df.iloc[:-1]
    return df

timeseries = [
    make_timeseries(start=datetime(2020,1,1), end=datetime(2023,12,31), freq='1min', seed=10).rename(columns=lambda x: f"{x}_{i}")
    for i in range(5)
]
df = pd.concat(timeseries, axis=1)

Print the fisrt rows to see what the data looks like.

In [None]:
df.head()

The method `info(memory_usage='deep')` returns the column types and also gives the memory usage of the dataframe.

In [None]:
df.info(memory_usage='deep')

Write the dataframe 

In [None]:
pathlib.Path("data").mkdir(parents=True,exist_ok=True)
df.to_parquet("timeseries.parquet")

## Load only useful data

Image that you are interested only by 

Imagine you're only interested in a subset of the dataset's columns `['a_0','a_1','cat_0','str1_0','str1_1']`. Then there are two ways to proceed: 
 * either load the entire dataset and then filter out the columns you're interested in
 * or read only the columns you're interested in

Compare the two loading methods.

Look at the `read_parquet`method

In [None]:
?pd.read_parquet

In [None]:
columns = ['a_0','a_1','cat_0','str1_0','str1_1']

**Option 1**: Load the entire dataset and then filter out the columns you're interested in

In [None]:
# TODO

**Option 2**: Read only the columns you're interested in. 

In [None]:
# TODO

You can use the magic command `%time` and `%memit` to compare the time and the memory usage of the two calls.

Not all the reading methods in Pandas has an option to read a subset of columns.

### Use efficient datatypes

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). 

Using more efficient data types reduces the memory size of a dataframe, so you can store larger datasets in memory.

In [None]:
df = pd.read_parquet("timeseries.parquet",columns=['a_0','b_0','c_0','cat_0','str1_0','str2_0'])

Look at the data types of each column

In [None]:
df.dtypes

Look at the memory usage of the dataframe. The `memory_usage()` method returns the memory usage of each column in bytes.

In [None]:
df.memory_usage(deep=True)

Compute the size of the dataframe. You should get the same result with the `info(memory_usage='deep')` method.

In [None]:
# TODO

The result of `memory_usage` show that the columns taking up much more memory are 'str1_0','str2_0','cat_0'. It seems normal for 'str1_0','str2_0' columns because those columns contains random strings. But 'cat_0' column has just a few unique values, so it’s a good candidate for converting to a pandas.Categorical. With a pandas.Categorical, we store each unique name once and use space-efficient integers to know which specific name is used in each row.

First, we copy our dataframe to a new one.

In [None]:
df2 = df.copy()

Try to change to column type to Pandas.category using the `astype()` method

In [None]:
# TODO

Check with dtypes that the column type has changed

In [None]:
# TODO

Compute the memory usage of each column for this new dataframe.

In [None]:
# TODO

We can go a bit further and downcast the numeric columns to their smallest types using pandas.to_numeric(). The "c_0" column contains number between 0 and 100. So it can be downcast to unsigned. If float precision is sufficient for columns 'a_0' et 'b_0', it is also possible to downcast to float. Be careful when you downcast, you lose precision and so you can propagate error during the processing.

In [None]:
# TODO

Check the types and the memory usage of the columns

In [None]:
# TODO

Compute the memory reduction

In [None]:
# TODO

# Use chunking

Some problem are embarrasingly parallel and so can be processed with chunking, which means by splitting a large problem into a bunch of small problems. 
For example, converting an big file into several smaller files and repeating the processing for each file in a directory. 
As long as each chunk fits in memory, you can work with datasets that are much larger than memory.

In [None]:
N = 12
starts = [f"20{i:>02d}-01-01" for i in range(N)]
ends = [f"20{i:>02d}-12-31" for i in range(N)]
pathlib.Path("data/timeseries").mkdir(parents=True,exist_ok=True)
for i, (start, end) in enumerate(zip(starts, ends)):
    ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
    ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")

Count the occurence of the values in the "c" column for all the files.

In [None]:
# TODO

Some readers, like pandas.read_csv(), offer parameters to control the chunksize when reading a single file. 
In that case, it is possible to read a file chunk by chunk in order to process it.

In [None]:
df = make_timeseries(start="2023-01-01", end="2023-12-31", freq="1min", seed=10)
df.to_csv("data/timeseries.csv")

Try to count the occurence of the values in the "c" column for the CSV file by process it chunk by chunk. You need to use the parameter `chunksize` in the `read_csv`method. 

In [None]:
# TODO