# Optimizing Dataframe Memory Footprint

## Introduction

In previous courses in the [Data Scientist track](https://www.dataquest.io/path/data-scientist), we used pandas to explore and analyze data sets without much consideration for performance. While performance is rarely a problem with small data sets (under 100 megabytes), it can start to become an issue with larger data sets (100 megabytes to multiple gigabytes). Performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.<br>

While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data, we're better off trying to get more out of pandas, rather than switching to a different tool.<br>

In this course, we'll explore different techniques for working with medium-sized data sets in pandas that don't fit in memory. In this mission, we'll learn how pandas represents the values in a data set in memory, and how to reduce a dataframe's memory footprint by selecting the appropriate data types for columns. In later missions, we'll learn how to process chunks of data in pandas, and augment pandas with SQLite.<br>

We'll be working with data on the [Museum of Modern Art's exhibitions](https://www.moma.org/). More specifically, we'll use the file `MoMAExhibitions1929to1989.csv`, which you can download from [data.world](https://data.world/moma/exhibitions). Here's a preview of the data set:

In [1]:
import pandas as pd
moma = pd.read_csv('../data/moma.csv');moma.head()

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ConstituentID,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Director,9168.0,...,,American,1902.0,1981.0,"American, 1902–1981",Male,109252853.0,Q711362,500241556.0,moma.org/artists/9168
1,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Artist,1053.0,...,,French,1839.0,1906.0,"French, 1839–1906",Male,39374836.0,Q35548,500004793.0,moma.org/artists/1053
2,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Artist,2098.0,...,,French,1848.0,1903.0,"French, 1848–1903",Male,27064953.0,Q37693,500011421.0,moma.org/artists/2098
3,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Artist,2206.0,...,,Dutch,1853.0,1890.0,"Dutch, 1853–1890",Male,9854560.0,Q5582,500115588.0,moma.org/artists/2206
4,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1.0,http://www.moma.org/calendar/exhibitions/1767,Artist,5358.0,...,,French,1859.0,1891.0,"French, 1859–1891",Male,24608076.0,Q34013,500008873.0,moma.org/artists/5358


We've renamed this data set to `moma.csv`. Let's start by reading in `moma.csv` as a dataframe and looking up how much memory it consumes by default. The `DataFrame.info()` method returns an estimate for the amount of memory a dataframe consumes. 

#### Note that this is just an estimate of the memory footprint. We'll take a look at how the method calculates it in the next step.

In [2]:
# display the memory usage of the `moma` dataframe
moma.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
ExhibitionID              34129 non-null float64
ExhibitionNumber          34558 non-null object
ExhibitionTitle           34558 non-null object
ExhibitionCitationDate    34557 non-null object
ExhibitionBeginDate       34558 non-null object
ExhibitionEndDate         33354 non-null object
ExhibitionSortOrder       34558 non-null float64
ExhibitionURL             34125 non-null object
ExhibitionRole            34424 non-null object
ConstituentID             34044 non-null float64
ConstituentType           34424 non-null object
DisplayName               34424 non-null object
AlphaSort                 34424 non-null object
FirstName                 31499 non-null object
MiddleName                3804 non-null object
LastName                  31998 non-null object
Suffix                    157 non-null object
Institution               2458 non-null object
Nationality               26

## How Pandas Represents Values in a Dataframe

The `moma` dataframe has an estimated memory footprint of 7.1+ megabytes. To grasp how pandas calculates this estimate, we first need to understand how pandas represents different types of values. Based on the dataframe summary from the last step, we can tell that the `moma` dataframe only contains `float64` and `object` columns. Let's examine how pandas represents these values.

#### The Internal Representation of a Dataframe

Under the hood, pandas groups the columns into blocks of values of the same type. Here's a preview of how pandas stores the first seven columns of the `moma` dataframe:

![how-pandas-represents-values-in-dataframe](https://s3.amazonaws.com/dq-content/pandas_dataframe_blocks.png)

You'll notice that the blocks don't maintain references to the column names. This is because blocks are optimized for storing the actual values in the dataframe. The [BlockManager class](https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L2691) is responsible for maintaining the mapping between the row and column indexes and the actual blocks. It acts as an API that provides access to the underlying data. Whenever we select, edit, or delete values, the dataframe class interfaces with the BlockManager class to translate our requests to function and method calls.<br>

Each type has a specialized class in the `pandas.core.internals` module. Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. Due to this storage scheme, accessing a slice of values is incredibly fast.<br>

To observe how the BlockManager organizes the data, we can retrieve the internal BlockManager object from within a dataframe using the `DataFrame._data` private attribute. This will return the column and row axes, as well as the individual Block instance for each unique type in the dataframe.

In [3]:
# Retrieve the underlying BlockManager instance 
# and display it using the print() function.

print(moma._data)

BlockManager
Items: Index(['ExhibitionID', 'ExhibitionNumber', 'ExhibitionTitle',
       'ExhibitionCitationDate', 'ExhibitionBeginDate', 'ExhibitionEndDate',
       'ExhibitionSortOrder', 'ExhibitionURL', 'ExhibitionRole',
       'ConstituentID', 'ConstituentType', 'DisplayName', 'AlphaSort',
       'FirstName', 'MiddleName', 'LastName', 'Suffix', 'Institution',
       'Nationality', 'ConstituentBeginDate', 'ConstituentEndDate',
       'ArtistBio', 'Gender', 'VIAFID', 'WikidataID', 'ULANID',
       'ConstituentURL'],
      dtype='object')
Axis 1: RangeIndex(start=0, stop=34558, step=1)
FloatBlock: [0, 6, 9, 19, 20, 23, 25], 7 x 34558, dtype: float64
ObjectBlock: [1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 24, 26], 20 x 34558, dtype: object


## Different Types Have Different Memory Footprints

#### Float Columns
The `float64` type represents each floating point value using **64 bits, or 8 bytes**. There are 34,558 rows in the dataframe, which means that each `float64` column in our dataframe uses **276,464 bytes of memory (34558 rows times 8 bytes)**.<br>

Under the hood, pandas represents numeric values as NumPy ndarrays, and stores them in a continuous block of memory. This storage model consumes less space and allows us to access the values themselves quickly. Because pandas represents each value of the same type using the same number of bytes, and a NumPy ndarray stores the number of values, pandas can return the number of bytes a numeric column consumes quickly and accurately.<br>

We can retrieve the amount of memory the values in a column consume using the `Series.nbytes` attribute.
* If you'd like, you can use the console to confirm that the values in a `float64` column from our dataframe use 276,464 bytes.

#### Object Columns

The `object` type represents **string values**. It represents each value using Python string objects, partly due to the lack of support for missing string values in NumPy. **Because Python is a high-level, interpreted language, it doesn't have fine grained-control over how values in memory are stored**.<br>

This limitation causes Python to store a list of strings in a **fragmented way that consumes more memory and is slower to access**. Each element in a Python list is really a pointer that contains the "address" for the actual value's location in memory. Here's a diagram that visualizes the **difference between how NumPy and Python store an array of values**:

![difference-between-numpy-and-python-store-array](https://s3.amazonaws.com/dq-content/numpy_vs_python.png)

If you want to learn more about the differences, read the blog post [Why Python Is Slow](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/), which is the where the diagram came from.<br>

While **each pointer takes up 8 bytes** of memory, **each actual string value uses a different amount of memory**.
* If you'd like, you can use the console to confirm that the pointers for the values in an `object` column from our dataframe use 276,464 bytes (34,558 values times 8 bytes). Use the `Series.nbytes` attribute the same way we did before.

#### How pandas Estimates the Dataframe Memory Footprint

Because the NumPy array stores its own dimensions underneath and all of the values in a NumPy array have the same type, pandas can accurately calculate the memory footprint of numeric columns without having to look up each value.<br>

For `object` type columns, however, pandas only knows that each value consumes at least 8 bytes (for just the pointer) without manually inspecting the linked value. This means that pandas represents each value in a `float64` column and an `object` column using 8 bytes of memory.<br>

If you'll recall, a kilobyte is equivalent to 1,024 bytes (2^10), and a megabyte is equivalent to 1,048,576 bytes (2^20). With this in mind, we can calculate the estimated **shallow** memory footprint that the `DataFrame.info()` method returned.

* Recreate the estimate of the memory footprint by multiplying the number of values in the `moma` dataframe by 8. Assign this number to `total_bytes`.
  * Use the `DataFrame.size` attribute to return the number of values in a dataframe.
* Convert `total_bytes` from bytes to megabytes, and assign the result to `total_megabytes`.
* Display `total_bytes` and `total_megabytes` using the `print()` function.


In [4]:
moma.size

933066

In [5]:
total_bytes = 8*moma.size
total_megabytes = total_bytes / 2**20

print(total_bytes)
print(total_megabytes)

7464528
7.1187286377


## Calculating the True Memory Footprint

If you'll recall, the original memory footprint pandas returned was 7.1+ mb, which matches our result of 7.12 megabytes from the last step. To force pandas to inspect the memory for each linked string value and return the true memory footprint, we need to set the `memory_usage` parameter to `"deep"` when calling `DataFrame.info()`.

```python
>> moma.info(memory_usage="deep")
class 'pandas.core.frame.DataFrame'
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
ExhibitionID              34129 non-null float64
ExhibitionNumber          34558 non-null object
ExhibitionTitle           34558 non-null object
ExhibitionCitationDate    34557 non-null object
ExhibitionBeginDate       34558 non-null object
ExhibitionEndDate         33354 non-null object
ExhibitionSortOrder       34558 non-null float64
ExhibitionURL             34125 non-null object
ExhibitionRole            34424 non-null object
ConstituentID             34044 non-null float64
ConstituentType           34424 non-null object
DisplayName               34424 non-null object
AlphaSort                 34424 non-null object
FirstName                 31499 non-null object
MiddleName                3804 non-null object
LastName                  31998 non-null object
Suffix                    157 non-null object
Institution               2458 non-null object
Nationality               26072 non-null object
ConstituentBeginDate      25290 non-null float64
ConstituentEndDate        19819 non-null float64
ArtistBio                 26089 non-null object
Gender                    25796 non-null object
VIAFID                    26996 non-null float64
WikidataID                22241 non-null object
ULANID                    21688 non-null float64
ConstituentURL            34044 non-null object
dtypes: float64(7), object(20)
memory usage: 45.6 MB
```

**The true memory footprint of our dataframe is 45.6 megabytes**. This means that we'll require about 38.5 megabytes to store the actual Python strings for the `object` columns (45.6 - 7.1).<br>

Let's calculate the amount of memory just the `object` columns are consuming (both the pointers as well as the actual linked string values). We can use the `DataFrame.memory_usage()` method to return the amount of memory each column consumes. We need to set the `deep` parameter to `True` to display the deep memory footprint of each column:

```python
>> moma.memory_usage(deep=True)
Index                          80
ExhibitionID               276464
ExhibitionNumber          2085850
ExhibitionTitle           3333695
ExhibitionCitationDate    3577728
ExhibitionBeginDate       2281851
ExhibitionEndDate         2234872
ExhibitionSortOrder        276464
ExhibitionURL             3494606
ExhibitionRole            2179383
ConstituentID              276464
ConstituentType           2313112
DisplayName               2548428
AlphaSort                 2534329
FirstName                 2104929
MiddleName                1218953
LastName                  2162941
Suffix                    1110349
Institution               1221368
Nationality               1949664
ConstituentBeginDate       276464
ConstituentEndDate         276464
ArtistBio                 3183300
Gender                    1858994
VIAFID                     276464
WikidataID                1821293
ULANID                     276464
ConstituentURL            2677922
dtype: int64
```

* Select just the `object` columns from the `moma` dataframe and assign the resulting dataframe to `obj_cols`.
* Use the `DataFrame.memory_usage()` method and set the `deep` parameter to `True` to return the memory footprint of each column in obj_cols. Assign the resulting series to `obj_cols_mem`, and display it using a print statement.
* Use the `Series.sum()` method to sum the values in `obj_cols_mem`, convert the result to megabytes, and assign the result to `obj_cols_sum`. Display `obj_cols_sum` using a print statement.

In [10]:
obj_cols = moma.select_dtypes(include=['object'])
obj_cols_mem = obj_cols.memory_usage(deep=True)
print(obj_cols_mem)

Index                          80
ExhibitionNumber          2085850
ExhibitionTitle           3333695
ExhibitionCitationDate    3577728
ExhibitionBeginDate       2281851
ExhibitionEndDate         2234872
ExhibitionURL             3494606
ExhibitionRole            2179383
ConstituentType           2313112
DisplayName               2548428
AlphaSort                 2534329
FirstName                 2104929
MiddleName                1218953
LastName                  2162941
Suffix                    1110349
Institution               1221368
Nationality               1949664
ArtistBio                 3183300
Gender                    1858994
WikidataID                1821293
ConstituentURL            2677922
dtype: int64


In [11]:
obj_cols_sum = obj_cols_mem.sum()/2**20
print(obj_cols_sum)

43.7675924301


## Optimizing Integer Columns with Subtypes

Pandas uses 43.8 megabytes of the total 45.6 megabytes to represent the `object` columns. This means that we can achieve the greatest memory savings by converting `object` columns to numeric ones. Now that we understand how pandas represents two common data types in memory, let's learn more about the other types in pandas, their subtypes, and other ways we can reduce a dataframe's memory footprint.<br>

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the `float` type has the `float16`, `float32`, `float64`, and `float128` subtypes. The number portion of a type's name indicates the number of bits that type uses to represent values. For example, the subtypes we just listed use `2`, `4`, `8` and `16` bytes, respectively. The following table shows the subtypes for the most common pandas types:

object|bool|float|int|datetime
---|---|---|---|---
object|bool|float16|int8|datetime64
 | |float32|int16| 
 | |float64|int32| 
 | |float128|int64|
 
An `int8` value uses `1` byte (or `8` bits) to store a value, and can represent `256` values (`2^8`) in binary. This means that we can use this subtype to represent values ranging from `-128` to `127` (including `0`). We can use the `numpy.iinfo` class to verify the minimum and maximum values for each integer subtype:

```python
>> import numpy as np
>> int_types = ["int8", "int16", "int32", "int64"]
>> for it in int_types:
..     print(np.iinfo(it))
Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------
```

Using `numpy.iinfo()` returns a `numpy.core.getlimits.iinfo` object. We can access the values we're interested in using its `min` and `max` attributes:

```python
>> np.iinfo("int8").min
-128
>> np.iinfo("int8").max
127
```

We can use these functions to calculate the subtype that requires the least memory to represent all of the values in a given numeric column. We can save memory by converting within the same type (from `float64` to `float32` for example), or by converting between types (from `float64` to `int32`). Note that converting from `float64` to `int64` won't save any memory because values of both types are represented using `8` bytes.<br>

Finally, we have to represent missing values in numeric columns using a `float` subtype because the NumPy `int` type doesn't have a missing value object (like `NaN` for `float` values). 

### Trying to convert an `int` column that contains missing values to a `float` column will generate an error.

The `ExhibitionSortOrder` column is the only numeric column that doesn't contain any missing values:

```python
>> moma.select_dtypes(include=['float']).isnull().sum()
ExhibitionID              429
ExhibitionSortOrder         0
ConstituentID             514
ConstituentBeginDate     9268
ConstituentEndDate      14739
VIAFID                   7562
ULANID                  12870
dtype: int64
```

Let's find the `int` subtype that uses the smallest number of bytes to represent each value in this column.

* Find the smallest `int` subtype that can accommodate the values in the `ExhibitionSortOrder` column. Use the `Series.astype()` function to set the type, and assign it back to the `moma` dataframe.
  * If the column's maximum value is less than the `int8` maximum value and its minimum value is more than the `int8` minimum value, set the column's type to `int8`.
  * Else, If the column's maximum value is less than the `int16` maximum value and its minimum value is more than the `int16` minimum value, set the column's type to `int16`.
  * Else, If the column's maximum value is less than the `int32` maximum value and its minimum value is more than the `int32` minimum value, set the column's type to `int32`.
  * Else, If the column's maximum value is less than the `int64` maximum value and its minimum value is more than the `int64` minimum value, set the column's type to `int64`.
* Display the column's type and its memory usage.

In [12]:
import numpy as np

In [15]:
print(moma['ExhibitionSortOrder'].min())
print(moma['ExhibitionSortOrder'].max())

1.0
1768.0


In [34]:
print(np.iinfo("int8").min)
print(np.iinfo("int8").max)

-128
127


In [35]:
print(np.iinfo("int16").min)
print(np.iinfo("int16").max)

-32768
32767


In [22]:
moma['ExhibitionSortOrder'] = moma['ExhibitionSortOrder'].astype('int16')

In [33]:
moma['ExhibitionSortOrder'].dtype

dtype('int16')

In [32]:
moma['ExhibitionSortOrder'].memory_usage(deep=True)

69196

In [45]:
# a function to optimize column by its size

def opt_column_size(df, column):
    
    col_min = df[column].min()
    col_max = df[column].max()
    
    int_types = ['int8', 'int16', 'int32', 'int64']
    iinfo_int_minmax = [(np.iinfo(it).min, np.iinfo(it).max) 
                        for it in int_types]
    
    for i, minmax in enumerate(iinfo_int_minmax):
        
        if col_min >= minmax[0] and col_max <= minmax[1]:
            df[column] = df[column].astype(int_types[i])
            
            print(column, 'optimized to:', int_types[i])
            print('memory usage:', df[column].memory_usage(deep=True))
            print('datattype:', df[column].dtype)
            
            return 'Success'
        
    return 'Fail'
    
    
        
opt_column_size(moma, 'ExhibitionSortOrder')    

ExhibitionSortOrder optimized to: int16
memory usage: 69196
datattype: int16


'Success'

## Optimizing Float Columns With Subtypes

The optimal subtype for the `ExhibitionSortOrder` column is `int16`, which represents each value in the column using 2 bytes. Along with the index, which consumes 80 bytes of memory, the total memory the column consumes is 69196 bytes:

```python
>> value_bytes = len(moma) * 2 + 80
>> value_bytes == moma['ExhibitionSortOrder'].memory_usage(deep=True)
True
```

While we can use the `numpy.finfo` class along with the multiple if statement strategy to find the most space efficient `float` subtype for a column, this is a tedious process. At the same time, if we don't specify a `float` subtype when we use `Series.astype()` to convert a column, the function will use `float64` by default. If we try to convert a column to an `int` subtype without specifying a specific subtype, the default of `int64` will be used.

```python
# Reset the dataframe to the original CSV
>> moma = pd.read_csv("moma.csv")
>> moma['ExhibitionSortOrder'] = moma['ExhibitionSortOrder'].astype('int')
>> moma['ExhibitionSortOrder'].dtype
dtype('int64')
```

To help find the most space efficient type for a column, we can use the `pandas.to_numeric()` function. First, we need to convert to the general dtype, then use the `downcast` parameter when calling this function to ask pandas to find the optimal subtype:

```python
# Reset the dataframe to original CSV
>> moma = pd.read_csv("moma.csv")
>> moma['ExhibitionSortOrder'] = moma['ExhibitionSortOrder'].astype('int')
>> moma['ExhibitionSortOrder'] = pd.to_numeric(moma['ExhibitionSortOrder'], downcast='integer')
>> moma['ExhibitionSortOrder'].dtype
dtype('int16')
```

Note that we have to pass the string `"integer"` into the `downcast` parameter, not `"int"`. This technique will only work if the column is already recognized as a numeric type.

```python
# Reset the dataframe to original CSV
>> moma = pd.read_csv("moma.csv")
>> moma['ExhibitionSortOrder'] = pd.to_numeric(moma['ExhibitionSortOrder'], downcast='integer')
>> moma['ExhibitionSortOrder'].dtype
dtype('int64')
```


* Convert the remaining `float` columns to the most space efficient `float` subtype.
* Select the `float` columns again, and display their dtypes using the `DataFrame.dtypes` attribute.

In [48]:
float_cols = moma.select_dtypes(include=['float'])
float_cols.dtypes

ExhibitionID            float64
ConstituentID           float64
ConstituentBeginDate    float64
ConstituentEndDate      float64
VIAFID                  float64
ULANID                  float64
dtype: object

In [50]:
for fc in  float_cols.columns:
    moma[fc] = pd.to_numeric(moma[fc], downcast='float')
    print(fc, moma[fc].dtype)

ExhibitionID float32
ConstituentID float32
ConstituentBeginDate float32
ConstituentEndDate float32
VIAFID float32
ULANID float32


## Converting To DateTime

It looks like pandas couldn't find a more space efficient `float` subtype for the range of values in these `float` columns. Let's move on to another numeric dtype, `datetime`. If you take a look at the object columns in the dataframe you'll notice that we can convert both the `ExhibitionBeginDate` and the `ExhibitionEndDate` columns to the `datetime64` subtype to save space.

In [53]:
moma.head(2)

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ConstituentID,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,http://www.moma.org/calendar/exhibitions/1767,Director,9168.0,...,,American,1902.0,1981.0,"American, 1902–1981",Male,109252856.0,Q711362,500241568.0,moma.org/artists/9168
1,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,http://www.moma.org/calendar/exhibitions/1767,Artist,1053.0,...,,French,1839.0,1906.0,"French, 1839–1906",Male,39374836.0,Q35548,500004800.0,moma.org/artists/1053


While the `ExhibitionEndDate` column contains missing values, the `datetime` type supports missing values using the `NaT` object (similar to `NaN` for float).

```python
>> moma["ExhibitionEndDate"].isnull().sum()
1204
```

We can use the `pandas.to_datetime()` function to convert a column to the datetime type. This method accepts and returns a series object that we can assign back to the dataframe.


* Convert the `ExhibitionBeginDate` and `ExhibitionEndDate` columns to the `datetime` type, and assign the results back to the `moma` dataframe.
* Display the memory usage for both of these columns using the `DataFrame.memory_usage()` method.

In [54]:
moma['ExhibitionBeginDate'] = pd.to_datetime(moma['ExhibitionBeginDate'])
moma['ExhibitionEndDate'] = pd.to_datetime(moma['ExhibitionEndDate'])

moma[['ExhibitionBeginDate', 'ExhibitionEndDate']].memory_usage(deep=True)

Index                      80
ExhibitionBeginDate    276464
ExhibitionEndDate      276464
dtype: int64

## Converting to Categorical to Save Memory

Pandas introduced [Categoricals](http://pandas.pydata.org/pandas-docs/stable/categorical.html) in version 0.15. The `category` type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the `category` dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column.<br>

The `ConstituentType` column only has two unique values, for example. Converting it to the categorical type would save a lot of space.

```python
>> moma['ConstituentType'].memory_usage(deep=True)
2313192
>> print(moma['ConstituentType'].value_counts())
Individual     32008
Institution     2416
Name: ConstituentType, dtype: int64
>> moma['ConstituentType'] = moma['ConstituentType'].astype('category')
>> moma['ConstituentType'].memory_usage(deep=True)
34773
```

For most dataframe operations, pandas hides the integer representation from us and returns the raw data values (unless we specifically ask for the integers).


```python
>> moma['ConstituentType']
0        Individual
1        Individual
2        Individual
3        Individual
4        Individual
5        Individual
...
```


In the following code, we use the Series.cat.codes attribute to return the integer values the category type uses to represent each value.


```python
>> moma['ConstituentType'].cat.codes
0        0
1        0
2        0
3        0
4        0
5        0
...
```

The `category` subtype handle missing values by setting them to `-1`. Thanks to the flexibility of the `category` dtype, **we can drastically reduce a dataframe's memory footprint by converting all of the columns to this subtype**.

```python
>> moma = pd.read_csv("moma.csv")
>> for col in moma.columns:
..    moma[col] = moma[col].astype('category')
>> moma.info(memory_usage='deep')
class 'pandas.core.frame.DataFrame'
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
ExhibitionID              34129 non-null category
ExhibitionNumber          34558 non-null category
ExhibitionTitle           34558 non-null category
ExhibitionCitationDate    34557 non-null category
ExhibitionBeginDate       34558 non-null category
ExhibitionEndDate         33354 non-null category
ExhibitionSortOrder       34558 non-null category
ExhibitionURL             34125 non-null category
ExhibitionRole            34424 non-null category
ConstituentID             34044 non-null category
ConstituentType           34424 non-null category
DisplayName               34424 non-null category
AlphaSort                 34424 non-null category
FirstName                 31499 non-null category
MiddleName                3804 non-null category
LastName                  31998 non-null category
Suffix                    157 non-null category
Institution               2458 non-null category
Nationality               26072 non-null category
ConstituentBeginDate      25290 non-null category
ConstituentEndDate        19819 non-null category
ArtistBio                 26089 non-null category
Gender                    25796 non-null category
VIAFID                    26996 non-null category
WikidataID                22241 non-null category
ULANID                    21688 non-null category
ConstituentURL            34044 non-null category
dtypes: category(27)
memory usage: 6.4 MB
```

Converting each column to the `category` dtype reduced the memory footprint to just `6.4 mb`. While converting all of the columns to this type is appealing, it's important to be aware of the **trade-offs**. 
### The biggest one is the inability to perform numerical computations. 
We can't do arithmetic with category columns or use methods like `Series.min()` and `Series.max()` without converting to a true numeric dtype first.<br>

### We should stick to using the `category` type primarily for `object` columns where less than 50% of the values are unique. 
* If all of the values in a column are unique, the `category` type will end up using more memory. 

That's because the column is storing all of the raw string values in addition to the integer category codes. You can read more about the limitations of the `catgory` type in the [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/categorical.html#gotchas).

* Convert all `object` columns where less than half of the column's values are unique to the `category` dtype.
* Return the deep memory footprint using the `DataFrame.info()` method.

In [63]:
moma.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
ExhibitionID              34129 non-null float32
ExhibitionNumber          34558 non-null object
ExhibitionTitle           34558 non-null object
ExhibitionCitationDate    34557 non-null object
ExhibitionBeginDate       34558 non-null datetime64[ns]
ExhibitionEndDate         33354 non-null datetime64[ns]
ExhibitionSortOrder       34558 non-null int16
ExhibitionURL             34125 non-null object
ExhibitionRole            34424 non-null object
ConstituentID             34044 non-null float32
ConstituentType           34424 non-null object
DisplayName               34424 non-null object
AlphaSort                 34424 non-null object
FirstName                 31499 non-null object
MiddleName                3804 non-null object
LastName                  31998 non-null object
Suffix                    157 non-null object
Institution               2458 non-null object
Nationality   

In [68]:
obj_cols = moma.select_dtypes(include=['object'])

for oc in obj_cols.columns:
    changed = []
    if len(moma[oc])*.5 > len(moma[oc].unique()):
        moma[oc] = moma[oc].astype('category')
        changed.append(oc)

print('The following columns has been converted to category dtype.')
print(changed)

The following columns has been converted to category dtype.
['ConstituentURL']


### `deep memory usage` decreased 
### from `40.8mb` to `8.9mb`

In [70]:
moma.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
ExhibitionID              34129 non-null float32
ExhibitionNumber          34558 non-null category
ExhibitionTitle           34558 non-null category
ExhibitionCitationDate    34557 non-null category
ExhibitionBeginDate       34558 non-null datetime64[ns]
ExhibitionEndDate         33354 non-null datetime64[ns]
ExhibitionSortOrder       34558 non-null int16
ExhibitionURL             34125 non-null category
ExhibitionRole            34424 non-null category
ConstituentID             34044 non-null float32
ConstituentType           34424 non-null category
DisplayName               34424 non-null category
AlphaSort                 34424 non-null category
FirstName                 31499 non-null category
MiddleName                3804 non-null category
LastName                  31998 non-null category
Suffix                    157 non-null category
Institution               2458 non-nu

## Selecting Types While Reading the Data In

So far, we've explored ways to reduce the memory footprint of an **existing** dataframe. By reading the dataframe in first and then iterating on ways to save memory, we were able to understand the amount of memory we can expect to save from each optimization better. As we mentioned earlier in the mission, however, 
### we often won't have enough memory to represent all the values in a data set. 
* How can we apply memory-saving techniques when we can't even create the dataframe in the first place?

Fortunately, we can specify the optimal column types when we read the data set in. The [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function has **a few different parameters** that allow us to do this. The `dtype` parameter accepts a dictionary that has (string) column names as the keys and NumPy type objects as the values.

```python
import numpy as np
col_types = {"id": np.int32}
df = pd.read_csv('data.csv', dtypes=col_types)
```

The `parse_dates` parameter accepts a list of strings containing the names of the columns we want to parse as `datetime` values.

```python
df = pd.read_csv('data.csv', parse_dates=["StartDate", "EndDate"])
```

Finally, we can use the `usecols` parameter to specify which columns we want to include. Many data sets have redundant columns, or ones that aren't useful for analysis. Leaving them out entirely can save a lot of memory. This parameter accepts a list of string values or integer indexes.

```python
df = pd.read_csv('data.csv', usecols=["StartDate", "EndDate"])
```

Read `"moma.csv"` into a dataframe named moma:
* Set the `ExhibitionBeginDate` and `ExhibitionEndDate` columns to the `datetime` type.
* Include only these columns:
  * ExhibitionID
  * ExhibitionNumber
  * ExhibitionBeginDate
  * ExhibitionEndDate
  * ExhibitionSortOrder
  * ExhibitionRole
  * ConstituentType
  * DisplayName
  * Institution
  * Nationality
  * Gender
* Display the deep memory footprint in megabytes.

In [72]:
keep_cols = ['ExhibitionID', 'ExhibitionNumber', 'ExhibitionBeginDate', 'ExhibitionEndDate', 'ExhibitionSortOrder', 'ExhibitionRole', 'ConstituentType', 'DisplayName', 'Institution', 'Nationality', 'Gender']
to_parse_dates = ['ExhibitionBeginDate', 'ExhibitionEndDate']

In [73]:
moma_defaultset = pd.read_csv('../data/moma.csv')
moma_selected = pd.read_csv('../data/moma.csv',
                            parse_dates = to_parse_dates,
                           usecols=keep_cols)

In [74]:
moma_defaultset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
ExhibitionID              34129 non-null float64
ExhibitionNumber          34558 non-null object
ExhibitionTitle           34558 non-null object
ExhibitionCitationDate    34557 non-null object
ExhibitionBeginDate       34558 non-null object
ExhibitionEndDate         33354 non-null object
ExhibitionSortOrder       34558 non-null float64
ExhibitionURL             34125 non-null object
ExhibitionRole            34424 non-null object
ConstituentID             34044 non-null float64
ConstituentType           34424 non-null object
DisplayName               34424 non-null object
AlphaSort                 34424 non-null object
FirstName                 31499 non-null object
MiddleName                3804 non-null object
LastName                  31998 non-null object
Suffix                    157 non-null object
Institution               2458 non-null object
Nationality               26

In [75]:
moma_selected.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 11 columns):
ExhibitionID           34129 non-null float64
ExhibitionNumber       34558 non-null object
ExhibitionBeginDate    34558 non-null datetime64[ns]
ExhibitionEndDate      33354 non-null datetime64[ns]
ExhibitionSortOrder    34558 non-null float64
ExhibitionRole         34424 non-null object
ConstituentType        34424 non-null object
DisplayName            34424 non-null object
Institution            2458 non-null object
Nationality            26072 non-null object
Gender                 25796 non-null object
dtypes: datetime64[ns](2), float64(2), object(7)
memory usage: 14.6 MB


### Default `moma` deep memory usage: `45.6mb`
### Selected `moma` deep memory usage: `14.6mb`

## Next Steps
In this mission, we learned how pandas represents values in a data set under the hood, and how to reduce a dataframe's memory footprint. In the next mission, we'll explore how to process dataframes in chunks.