Tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high-quality data cleaning, exploration, and analysis. For medium-sized data, we're better off trying to get more out of pandas, rather than switching to a different tool.

In this file, we'll:

* Learn how to evaluate the memory footprint of a pandas dataframe.
* Learn about the pandas datatypes.
* Learn how to select the most efficient datatypes for a pandas dataframe.

We'll be working with data on the [Museum of Modern Art's](https://www.moma.org/) exhibitions. More specifically, we'll use the file `moma.csv`, which we can download from [data.world](https://data.world/moma/exhibitions).

![image.png](attachment:image.png)

In [10]:
import pandas as pd
import numpy as np
moma = pd.read_csv("moma.csv")
print(moma.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ExhibitionID            34129 non-null  float64
 1   ExhibitionNumber        34558 non-null  object 
 2   ExhibitionTitle         34558 non-null  object 
 3   ExhibitionCitationDate  34557 non-null  object 
 4   ExhibitionBeginDate     34558 non-null  object 
 5   ExhibitionEndDate       33354 non-null  object 
 6   ExhibitionSortOrder     34558 non-null  float64
 7   ExhibitionURL           34125 non-null  object 
 8   ExhibitionRole          34424 non-null  object 
 9   ConstituentID           34044 non-null  float64
 10  ConstituentType         34424 non-null  object 
 11  DisplayName             34424 non-null  object 
 12  AlphaSort               34424 non-null  object 
 13  FirstName               31499 non-null  object 
 14  MiddleName              3804 non-null 

The last line is the one that we are interested in right now. It gives an estimate of the memory consumption of the `moma` dataframe. In this case, it estimates that the dataframe occupies `7.1 mega bytes` in memory.

We estimated that the moma dataframe has an estimated memory footprint of 7.1+ megabytes.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We'll notice that the blocks don't maintain references to the column names. This is because blocks are optimized for storing the actual values in the dataframe.

The [BlockManager](https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/managers.py#LC63) class is responsible for maintaining the mapping between the row and column indexes and the actual blocks. It acts as an API that provides access to the underlying data. Whenever we select, edit, or delete values, the dataframe class interfaces with the BlockManager class to translate our requests to function and method calls.

Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. Due to this storage scheme, accessing a slice of values is incredibly fast.

To observe how the BlockManager organizes the data, we can retrieve the internal BlockManager object from within a dataframe using the [`DataFrame._data`private attribute](https://www.geeksforgeeks.org/private-variables-python/). This will return the column and row axes, as well as the individual Block instance for each unique type in the dataframe.

In [4]:
# Access the DataFrame._data private attribute to retrieve the underlying BlockManager instance
print(moma._data)

BlockManager
Items: Index(['ExhibitionID', 'ExhibitionNumber', 'ExhibitionTitle',
       'ExhibitionCitationDate', 'ExhibitionBeginDate', 'ExhibitionEndDate',
       'ExhibitionSortOrder', 'ExhibitionURL', 'ExhibitionRole',
       'ConstituentID', 'ConstituentType', 'DisplayName', 'AlphaSort',
       'FirstName', 'MiddleName', 'LastName', 'Suffix', 'Institution',
       'Nationality', 'ConstituentBeginDate', 'ConstituentEndDate',
       'ArtistBio', 'Gender', 'VIAFID', 'WikidataID', 'ULANID',
       'ConstituentURL'],
      dtype='object')
Axis 1: RangeIndex(start=0, stop=34558, step=1)
FloatBlock: [ 0  6  9 19 20 23 25], 7 x 34558, dtype: float64
ObjectBlock: [ 1  2  3  4  5  7  8 10 11 12 13 14 15 16 17 18 21 22 24 26], 20 x 34558, dtype: object


![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

If we want to learn more about the differences, read the blog post [Why Python Is Slow](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/), which is where the diagram came from.

While each pointer takes up 8 bytes of memory, each actual string value uses a different amount of memory.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

According to what we've learned, pandas only counts the 8 bytes used by the reference of `object` datatypes. Each column of the `moma` dataframe is either of type `object` or `float64`. Since a 64-bit float also uses 8 bytes, the total memory estimation will be equal to the number of entries of the dataframe times eight.

Let's recreate the estimate of the memory footprint.

**Task**

![image.png](attachment:image.png)

**Answer**

In [5]:
num_entries = moma.size
total_bytes = num_entries * 8
total_megabytes = total_bytes / 1048576

print(total_megabytes)

7.1187286376953125


If we'll recall, the original memory footprint pandas returned was 7.1+ MB, which matches our result of 7.118 megabytes from the last step. To force pandas to inspect the memory for each linked string value and return the true memory footprint, we need to set the `memory_usage` parameter to `"deep"` when calling `DataFrame.info()`.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [6]:
obj_cols = moma.select_dtypes(include=['object'])
obj_cols_mem = obj_cols.memory_usage(deep=True)
obj_cols_sum = obj_cols_mem.sum() / 1048576
print(obj_cols_sum)

43.76634883880615


![image.png](attachment:image.png)

![image.png](attachment:image.png)

Above we've learned that integer datatypes do not have a value to represent missing values.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Let's use this to identify which numerical columns have no missing values and change their type to the smallest integer data type that can represent all of the column values.

In [7]:
def change_to_int(df, col_name):
    
    """This function changes the type of the provided column to the smallest integers datatype 
    that can represent all values of that column."""
    
    # Get the minimum and maximum values
    col_max = df[col_name].max()
    col_min = df[col_name].min()
    # Find the datatype
    for dtype_name in ['int8', 'int16', 'int32', 'int64']:
        # Check if this datatype can hold all values
        if col_max <  np.iinfo(dtype_name).max and col_min > np.iinfo(dtype_name).min:
            df[col_name] = df[col_name].astype(dtype_name)
            break

**Task**

![image.png](attachment:image.png)

**Answer**

In [15]:
float_moma = moma.select_dtypes(include=['float64'])
print(float_moma.isnull().sum())

# By running we got that ExhibitionSortOrder has no missing values

print("------------------------------")
change_to_int(moma, 'ExhibitionSortOrder')
print(moma['ExhibitionSortOrder'].dtype)

ExhibitionID              429
ConstituentID             514
ConstituentBeginDate     9268
ConstituentEndDate      14739
VIAFID                   7562
ULANID                  12870
dtype: int64
------------------------------
int16


Above, we've found that the optimal subtype for the `ExhibitionSortOrder` column is `int16`, which represents each value in the column using 2 bytes (16 bits).

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Let's use this to convert the floating-point columns.

In [17]:
float_cols = moma.select_dtypes(include=['float']).columns
float_cols

Index(['ExhibitionID', 'ConstituentID', 'ConstituentBeginDate',
       'ConstituentEndDate', 'VIAFID', 'ULANID'],
      dtype='object')

**Task**

![image.png](attachment:image.png)

**Answer**

In [18]:
for col in float_cols:
    moma[col] = pd.to_numeric(moma[col], downcast='float')

Let's move on to another numeric dtype, `datetime`. If we take a look at the object columns in the dataframe we'll notice that we can convert both the `ExhibitionBeginDate` and the `ExhibitionEndDate` columns represent dates.

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [19]:
moma["ExhibitionBeginDate"] = pd.to_datetime(moma["ExhibitionBeginDate"])
moma["ExhibitionEndDate"] = pd.to_datetime(moma["ExhibitionEndDate"])

print(moma[["ExhibitionBeginDate", "ExhibitionEndDate"]].memory_usage(deep=True))

Index                     128
ExhibitionBeginDate    276464
ExhibitionEndDate      276464
dtype: int64


![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

The `category` subtype handles missing values by setting them to `-1`. Thanks to the flexibility of the `category` dtype, we can drastically reduce a dataframe's memory footprint by converting all of the columns to this subtype.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We should stick to using the `category` type primarily for `object` columns where less than 50% of the values are unique. If all of the values in a column are unique, the `category` type will end up using more memory. That's because the column is storing all of the raw string values in addition to the integer category codes. We can read more about the limitations of the `category` type in the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html).

**Task**

![image.png](attachment:image.png)

**Answer**

In [20]:
for col in moma.select_dtypes(include=['object']):
    num_unique_values = len(moma[col].unique())
    num_total_values = len(moma[col])
    if num_unique_values / num_total_values < 0.5:
        moma[col] = moma[col].astype('category')
        
print(moma.info(memory_usage='deep'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   ExhibitionID            34129 non-null  float32       
 1   ExhibitionNumber        34558 non-null  category      
 2   ExhibitionTitle         34558 non-null  category      
 3   ExhibitionCitationDate  34557 non-null  category      
 4   ExhibitionBeginDate     34558 non-null  datetime64[ns]
 5   ExhibitionEndDate       33354 non-null  datetime64[ns]
 6   ExhibitionSortOrder     34558 non-null  int16         
 7   ExhibitionURL           34125 non-null  category      
 8   ExhibitionRole          34424 non-null  category      
 9   ConstituentID           34044 non-null  float32       
 10  ConstituentType         34424 non-null  category      
 11  DisplayName             34424 non-null  category      
 12  AlphaSort               34424 non-null  catego

So far, we've explored ways to reduce the memory footprint of an **existing** dataframe. By reading the dataframe in first and then iterating on ways to save memory, we were able to understand the amount of memory we can expect to save from each optimization better. As we mentioned earlier, however, we often won't have enough memory to represent all the values in a data set. How can we apply memory-saving techniques when we can't even create the dataframe in the first place?

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [22]:
keep_cols = ['ExhibitionID', 'ExhibitionNumber', 'ExhibitionBeginDate', 'ExhibitionEndDate', 
             'ExhibitionSortOrder', 'ExhibitionRole', 'ConstituentType', 'DisplayName', 
             'Institution', 'Nationality', 'Gender']

moma = pd.read_csv("moma.csv", parse_dates=["ExhibitionBeginDate", "ExhibitionEndDate"],
                   usecols=keep_cols)

print(moma.memory_usage(deep=True).sum()/(1024*1024))

14.554579734802246


In this file, we learned how pandas represents values in a data set under the hood, and how to reduce a dataframe's memory footprint. In the next file, we'll explore how to process dataframes in chunks.