# Optimizing Dataframe Memory Footprint

## Introduction

In previous courses in the [Data Scientist track](https://www.dataquest.io/path/data-scientist), we used pandas to explore and analyze data sets without much consideration for performance. While performance is rarely a problem with small data sets (under 100 megabytes), it can start to become an issue with larger data sets (100 megabytes to multiple gigabytes). Performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.<br>

While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data, we're better off trying to get more out of pandas, rather than switching to a different tool.<br>

In this course, we'll explore different techniques for working with medium-sized data sets in pandas that don't fit in memory. In this mission, we'll learn how pandas represents the values in a data set in memory, and how to reduce a dataframe's memory footprint by selecting the appropriate data types for columns. In later missions, we'll learn how to process chunks of data in pandas, and augment pandas with SQLite.<br>

We'll be working with data on the [Museum of Modern Art's exhibitions](https://www.moma.org/). More specifically, we'll use the file `MoMAExhibitions1929to1989.csv`, which you can download from [data.world](https://data.world/moma/exhibitions). Here's a preview of the data set:

In [1]:
import pandas as pd
moma = pd.read_csv('../data/moma.csv');moma.head()

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ConstituentID,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Director,9168.0,...,,American,1902.0,1981.0,"American, 1902–1981",Male,109252853.0,Q711362,500241556.0,moma.org/artists/9168
1,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Artist,1053.0,...,,French,1839.0,1906.0,"French, 1839–1906",Male,39374836.0,Q35548,500004793.0,moma.org/artists/1053
2,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Artist,2098.0,...,,French,1848.0,1903.0,"French, 1848–1903",Male,27064953.0,Q37693,500011421.0,moma.org/artists/2098
3,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Artist,2206.0,...,,Dutch,1853.0,1890.0,"Dutch, 1853–1890",Male,9854560.0,Q5582,500115588.0,moma.org/artists/2206
4,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Artist,5358.0,...,,French,1859.0,1891.0,"French, 1859–1891",Male,24608076.0,Q34013,500008873.0,moma.org/artists/5358


We've renamed this data set to `moma.csv`. Let's start by reading in `moma.csv` as a dataframe and looking up how much memory it consumes by default. The `DataFrame.info()` method returns an estimate for the amount of memory a dataframe consumes. 

#### Note that this is just an estimate of the memory footprint. We'll take a look at how the method calculates it in the next step.

In [2]:
# display the memory usage of the `moma` dataframe
moma.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
ExhibitionID              34129 non-null float64
ExhibitionNumber          34558 non-null object
ExhibitionTitle           34558 non-null object
ExhibitionCitationDate    34557 non-null object
ExhibitionBeginDate       34558 non-null object
ExhibitionEndDate         33354 non-null object
ExhibitionSortOrder       34558 non-null float64
ExhibitionURL             34125 non-null object
ExhibitionRole            34424 non-null object
ConstituentID             34044 non-null float64
ConstituentType           34424 non-null object
DisplayName               34424 non-null object
AlphaSort                 34424 non-null object
FirstName                 31499 non-null object
MiddleName                3804 non-null object
LastName                  31998 non-null object
Suffix                    157 non-null object
Institution               2458 non-null object
Nationality               26

## How Pandas Represents Values in a Dataframe

The `moma` dataframe has an estimated memory footprint of 7.1+ megabytes. To grasp how pandas calculates this estimate, we first need to understand how pandas represents different types of values. Based on the dataframe summary from the last step, we can tell that the `moma` dataframe only contains `float64` and `object` columns. Let's examine how pandas represents these values.

#### The Internal Representation of a Dataframe

Under the hood, pandas groups the columns into blocks of values of the same type. Here's a preview of how pandas stores the first seven columns of the `moma` dataframe:

![how-pandas-represents-values-in-dataframe](https://s3.amazonaws.com/dq-content/pandas_dataframe_blocks.png)

You'll notice that the blocks don't maintain references to the column names. This is because blocks are optimized for storing the actual values in the dataframe. The [BlockManager class](https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L2691) is responsible for maintaining the mapping between the row and column indexes and the actual blocks. It acts as an API that provides access to the underlying data. Whenever we select, edit, or delete values, the dataframe class interfaces with the BlockManager class to translate our requests to function and method calls.<br>

Each type has a specialized class in the `pandas.core.internals` module. Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. Due to this storage scheme, accessing a slice of values is incredibly fast.<br>

To observe how the BlockManager organizes the data, we can retrieve the internal BlockManager object from within a dataframe using the `DataFrame._data` private attribute. This will return the column and row axes, as well as the individual Block instance for each unique type in the dataframe.

In [3]:
# Retrieve the underlying BlockManager instance 
# and display it using the print() function.

print(moma._data)

BlockManager
Items: Index(['ExhibitionID', 'ExhibitionNumber', 'ExhibitionTitle',
       'ExhibitionCitationDate', 'ExhibitionBeginDate', 'ExhibitionEndDate',
       'ExhibitionSortOrder', 'ExhibitionURL', 'ExhibitionRole',
       'ConstituentID', 'ConstituentType', 'DisplayName', 'AlphaSort',
       'FirstName', 'MiddleName', 'LastName', 'Suffix', 'Institution',
       'Nationality', 'ConstituentBeginDate', 'ConstituentEndDate',
       'ArtistBio', 'Gender', 'VIAFID', 'WikidataID', 'ULANID',
       'ConstituentURL'],
      dtype='object')
Axis 1: RangeIndex(start=0, stop=34558, step=1)
FloatBlock: [0, 6, 9, 19, 20, 23, 25], 7 x 34558, dtype: float64
ObjectBlock: [1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 24, 26], 20 x 34558, dtype: object
