https://www.dataquest.io/blog/pandas-big-data/

In [2]:
import pandas as pd

Data is on S3: mytaxi-datascience-passenger-destination/data/processed/sorted_passenger_tours_processed.csv.zip

In [None]:
df = pd.read_csv('/Users/caiomiyashiro/repo/passenger_destination/data/raw/last_16_passenger_tours_v2.csv', index_col=0)

How much space does this dataframe use? (17.2 GB)

In [None]:
df.info(memory_usage='deep') 

# What is the Internal Representation of a Dataframe?
  
Under the hood, a Pandas Data Frame stores all the contiguous values in different sub data structures. For numeric data structures, each sub-block is stored in a numpy array, which per se are wrappers for C arrays style, which finally makes it efficient to access and process.

The same doesn't happen for object blocks, which can be anything else besides numeric. With this, they're subject to all the Python slowliness it is subject to (Source: [Why Python is Slow?](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/))

<img src="img1.png" width="800">

Lets see what is the average column memory size for our dataframe:

In [None]:
def average_memory_use(df, dtypes=['float','int','object']):
    for dtype in dtypes:
        selected_dtype = df.select_dtypes(include=[dtype])
        mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
        mean_usage_mb = mean_usage_b / 1024 ** 2
        print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))
        
average_memory_use(df)        

# Some stuff we can do to decrease the data size:

## Defining function to extract the column's size

In [None]:
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)

original_df_size = mem_usage(df)
print('Original data frame memory size: ' + str(original_df_size))

## pd.to_numeric() to downcast numeric variables - Floats or Integers

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html

In [None]:
df_float = df.select_dtypes(include=['float'])
df_float_converted = df_float.apply(pd.to_numeric,downcast='float')

print('Memory used by all the float64:')
print(mem_usage(df_float))
print('\n')
print('New memory size used by all the float32:')
print(mem_usage(df_float_converted))

# Why string objects take so much space?
  
  
Every string present in the dataframe will be repetively stored in the computer's memory, *i.e.* if I have a columns of 1 million rows with a string value in it, python will store 1 million string on memory, even though it could store just a reference to same place in memory.
  

<img src="numpy_vs_python.png" width="600">
source: https://www.dataquest.io/blog/pandas-big-data/  

## Converting string objects to category

[The category dtype](https://pandas.pydata.org/pandas-docs/stable/categorical.html) does exactly the above, it creates a numeric representation of each category value and just store in the array a reference to each value. This is the **main** memory save **if** we have a limited amount of category values of course. If we convert an id columns, it will at the end just increase the memory size because it had to create an numeric mapping and still keep all the distinct category values in memory.  
  
A rule of thumb (totally) is to turn variables in category if the amount of distinct values <= 50% of the numbers of rows. Keeping like this it is more probable we end up saving memory :)

In [None]:
df.select_dtypes(include=['object']).columns

In [None]:
df_object = df.select_dtypes(include=['object']).copy()
# we don't want the convert date_created to category, we want to keep it as datetime
df_object.drop('date_created', axis=1,inplace=True) 

df_object_converted = df_object.apply(lambda col: col.astype('category'))

print('Memory used by all the object dtypes:')
print(mem_usage(df_object))
print('\n')
print('New memory size used by all the category dtypes:')
print(mem_usage(df_object_converted))

In [9]:
1134.64/14973.74

0.0757753240005503

That is equal to > 90% memory reduction!
  
Below we can see the transformation is just under the hood. Up front, the data is 100% the same.

In [None]:
df_object.head()

In [None]:
df_object_converted.head()

# Joining Every Transformation

Lets take the original dataset, apply the transformations and compare the old with the new total size:

In [None]:
df[df_float_converted.columns] = df_float_converted
df[df_object_converted.columns] = df_object_converted

new_df_size = mem_usage(df)

print('Original data frame memory size: ' + str(original_df_size))
print('New data frame memory size: ' + str(new_df_size))

# During Projects

If the dataset is already too big, we wouldn't be able to read the whole file so just then decrease its size. An approach that I did was to read in a limited amount of lines and process the lines in order to identify the objects dtypes, as they are the most useless in terms of memory usage. For that we use the **nrows** parameter and them loop create a dictionary with the columns' names and their dtypes :

In [13]:
# Read only a sample to detect data types
df = pd.read_csv('/Users/caiomiyashiro/repo/passenger_destination/data/raw/last_16_passenger_tours_v2.csv', index_col=0, nrows=10000)
print(df.shape)
display(df.head())

(10000, 60)


Unnamed: 0,id_passenger,date_created,id,request_long,request_lat,request_street,request_street_number,dest_long,dest_lat,dest_street,...,lag10_street,lag10_street_number,home_dest_long,home_dest_lat,home_street_name,home_street_number,work_dest_long,work_dest_lat,work_street_name,work_street_number
0,5529,2017-01-03 23:34:08.724,34667961,13.43457,52.52301,Landsberger Allee,4.0,13.38814,52.51476,Französische Straße,...,,,11.5783,48.11313,Perlacher Straße,8,13.387784,52.514698,Französische Straße,55
1,5529,2017-01-19 07:00:10.124,35328512,13.38778,52.5147,Französische Straße,55.0,13.29529,52.55517,,...,,,11.5783,48.11313,Perlacher Straße,8,13.387784,52.514698,Französische Straße,55
2,5529,2017-04-02 17:53:14.823,39716974,11.5783,48.11313,Perlacher Straße,8.0,11.78946,48.35712,Terminalstraße Ost,...,,,11.5783,48.11313,Perlacher Straße,8,13.387784,52.514698,Französische Straße,55
3,5529,2017-08-12 13:41:55.805,53875521,9.96889,53.55451,Heiligengeistfeld,,9.98529,53.54142,Platz der Deutschen Einheit,...,,,11.5783,48.11313,Perlacher Straße,8,13.387784,52.514698,Französische Straße,55
4,5529,2017-09-04 16:23:47.750,56454276,10.02013,53.5571,Steindamm,105.0,9.96422,53.55922,Neuer Pferdemarkt,...,,,11.5783,48.11313,Perlacher Straße,8,13.387784,52.514698,Französische Straße,55


In [14]:
dtypes_df = {}
obj_cols = df.select_dtypes(include=['object']).columns[1:].tolist() # [1:] is just to skip the 'date_created'
for obj_col in obj_cols:
    if(len(df[obj_col].drop_duplicates()) < df.shape[0]/2):
        dtypes_df[obj_col] = 'category'
dtypes_df        

{'dest_street': 'category',
 'dest_street_number': 'category',
 'home_street_name': 'category',
 'home_street_number': 'category',
 'lag10_street': 'category',
 'lag10_street_number': 'category',
 'lag1_street': 'category',
 'lag1_street_number': 'category',
 'lag2_street': 'category',
 'lag2_street_number': 'category',
 'lag3_street': 'category',
 'lag3_street_number': 'category',
 'lag4_street': 'category',
 'lag4_street_number': 'category',
 'lag5_street': 'category',
 'lag5_street_number': 'category',
 'lag6_street': 'category',
 'lag6_street_number': 'category',
 'lag7_street': 'category',
 'lag7_street_number': 'category',
 'lag8_street': 'category',
 'lag8_street_number': 'category',
 'lag9_street': 'category',
 'lag9_street_number': 'category',
 'request_street': 'category',
 'request_street_number': 'category',
 'work_street_name': 'category',
 'work_street_number': 'category'}

We use the created dictionary as input for the **pd.read_csv** function in the **dtype** parameter. Pandas won't try to infer the column's type and directly try to apply the specified format

In [15]:
df = pd.read_csv('../data/raw/sorted_passenger_tours_v2.csv', index_col=0, 
                 dtype=dtypes_df, parse_dates=['date_created'], infer_datetime_format=True)

print('Predefined formats data frame memory size: ' + str(mem_usage(df)))

  mask |= (ar1 == a)


Predefined formats data frame memory size: 3225.63 MB


# Things I know it doesn't work and other questions:

- Unfortunately, when working with pandas, I know the category data type is not automatically interpreted as factors, like in R. I haven't found a solution for this besides using again lots of memory to convert the category dtypes to strings (pandas will do this when calling pd.get_dummies() anyway)
  
- If we have to deal with string manipulation, I don't know if the category dtype impact in the performance.  
  
- In case of nominal data where the categories have an order, *e.g.* 'disagree', 'agree', 'strongly agree', we can instantiate the categorical dtype in a similar way, using one extra parameter of *order = True* [Link here](https://pandas.pydata.org/pandas-docs/stable/categorical.html#controlling-behavior)