Pickling data for faster load times.

In [1]:
import pandas as pd
import pickle

In [2]:
%%time
may = pd.read_csv('../data/may.csv')
may.head()

Wall time: 19.4 s


Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-05-01 00:01:41.247000,36.136822,-86.799877,PoweredLIRL1,Powered,93.0,scooter,0.0,Bird
1,2019-05-01 00:01:41.247000,36.191252,-86.772945,PoweredXWRWC,Powered,35.0,scooter,0.0,Bird
2,2019-05-01 00:01:41.247000,36.144752,-86.806293,PoweredMEJEH,Powered,90.0,scooter,0.0,Bird
3,2019-05-01 00:01:41.247000,36.162056,-86.774688,Powered1A7TC,Powered,88.0,scooter,0.0,Bird
4,2019-05-01 00:01:41.247000,36.150973,-86.783109,Powered2TYEF,Powered,98.0,scooter,0.0,Bird


Get info on file, then look to reduce file size. Remember: objects take up the most space.

In [3]:
may.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20292503 entries, 0 to 20292502
Data columns (total 9 columns):
 #   Column       Dtype  
---  ------       -----  
 0   pubdatetime  object 
 1   latitude     float64
 2   longitude    float64
 3   sumdid       object 
 4   sumdtype     object 
 5   chargelevel  float64
 6   sumdgroup    object 
 7   costpermin   float64
 8   companyname  object 
dtypes: float64(4), object(5)
memory usage: 1.4+ GB


Convert each unique company name to an integer using a dictionary, then update the 'companyname' column.

In [4]:
may.companyname.unique()

array(['Bird', 'Lyft', 'Gotcha', 'Lime', 'Spin', 'Jump', 'Bolt'],
      dtype=object)

In [5]:
company_dict = {'Bird':0, 'Lyft': 1, 'Gotcha': 2, 'Lime': 3, 'Spin': 4, 'Jump': 5, 'Bolt': 6}

In [6]:
may.companyname = may.companyname.replace(company_dict)

Convert 'pubdatetime' to a datetime.

In [7]:
may.pubdatetime = pd.to_datetime(may.pubdatetime)
may.head(3)

Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-05-01 00:01:41.247,36.136822,-86.799877,PoweredLIRL1,Powered,93.0,scooter,0.0,0
1,2019-05-01 00:01:41.247,36.191252,-86.772945,PoweredXWRWC,Powered,35.0,scooter,0.0,0
2,2019-05-01 00:01:41.247,36.144752,-86.806293,PoweredMEJEH,Powered,90.0,scooter,0.0,0


Remove unwanted data (don't need the bikes)

In [8]:
may.sumdgroup.unique()

array(['scooter', 'Scooter', 'bicycle'], dtype=object)

In [9]:
may_scooters = may.loc[may.sumdgroup.isin(['scooter', 'Scooter'])]

Narrow down to just the columns we want to work with.

In [10]:
may_scooters = may_scooters[['pubdatetime', 'latitude', 'longitude', 'sumdid', 'chargelevel', 'companyname']]

Check your .info() again

In [11]:
may_scooters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20283582 entries, 0 to 20292502
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   pubdatetime  datetime64[ns]
 1   latitude     float64       
 2   longitude    float64       
 3   sumdid       object        
 4   chargelevel  float64       
 5   companyname  int64         
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 1.1+ GB


Only one object left so we're good to go. Time to pickle.

In [14]:
may_scooters.to_pickle("../data/may.pkl")

Check your time to see how much faster it runs:

In [16]:
%%time
may_test = pd.read_pickle("../data/may.pkl")

Wall time: 1.1 s


Much better! From 19.4s to 1.1s.

Now we need to do the same pickling process to both June and July.

JUNE PICKLING

In [17]:
%%time
june = pd.read_csv('../data/june.csv')
june.head()

Wall time: 28.6 s


Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-06-01 00:00:12,36.1202,-86.7534,Powered93627c35-0f62-5b81-a78d-75a4a92ecf47,Powered,90.0,scooter,0.06,Jump
1,2019-06-01 00:00:12,36.163,-86.7765,Powered17715097-e8a0-5494-a5ab-9b625796607d,Powered,63.0,scooter,0.06,Jump
2,2019-06-01 00:00:12,36.1202,-86.7533,Powerede5cb95ae-b091-5a93-86fa-ededd946d0d7,Powered,77.0,scooter,0.06,Jump
3,2019-06-01 00:00:12,36.1201,-86.753,Powered71fa5e4f-1e17-54c4-936d-330df38cc2fa,Powered,0.0,scooter,0.06,Jump
4,2019-06-01 00:00:12,36.1622,-86.7806,Poweredfa549dd6-40bb-5757-ac87-2c2528f2bc68,Powered,2.0,scooter,0.06,Jump


In [18]:
june.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28046095 entries, 0 to 28046094
Data columns (total 9 columns):
 #   Column       Dtype  
---  ------       -----  
 0   pubdatetime  object 
 1   latitude     float64
 2   longitude    float64
 3   sumdid       object 
 4   sumdtype     object 
 5   chargelevel  float64
 6   sumdgroup    object 
 7   costpermin   float64
 8   companyname  object 
dtypes: float64(4), object(5)
memory usage: 1.9+ GB


In [19]:
june.companyname.unique()

array(['Jump', 'Bird', 'Bolt', 'Gotcha', 'Spin', 'Lime', 'Lyft'],
      dtype=object)

In [20]:
company_dict = {'Bird':0, 'Lyft': 1, 'Gotcha': 2, 'Lime': 3, 'Spin': 4, 'Jump': 5, 'Bolt': 6}

In [21]:
june.companyname = june.companyname.replace(company_dict)

In [22]:
june.pubdatetime = pd.to_datetime(june.pubdatetime)
june.head(2)

Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-06-01 00:00:12,36.1202,-86.7534,Powered93627c35-0f62-5b81-a78d-75a4a92ecf47,Powered,90.0,scooter,0.06,5
1,2019-06-01 00:00:12,36.163,-86.7765,Powered17715097-e8a0-5494-a5ab-9b625796607d,Powered,63.0,scooter,0.06,5


In [23]:
june.sumdgroup.unique()

array(['scooter', 'Scooter', 'bicycle'], dtype=object)

In [24]:
june_scooters = june.loc[june.sumdgroup.isin(['scooter', 'Scooter'])]

In [25]:
june_scooters = june_scooters[['pubdatetime', 'latitude', 'longitude', 'sumdid', 'chargelevel', 'companyname']]

In [26]:
june_scooters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28037408 entries, 0 to 28046094
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   pubdatetime  datetime64[ns]
 1   latitude     float64       
 2   longitude    float64       
 3   sumdid       object        
 4   chargelevel  float64       
 5   companyname  int64         
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 1.5+ GB


In [27]:
june_scooters.to_pickle("../data/june.pkl")

In [28]:
%%time
june_test = pd.read_pickle("../data/june.pkl")

Wall time: 1.6 s


JULY PICKLING

In [29]:
%%time
july = pd.read_csv('../data/july.csv')
july.head()

Wall time: 24.9 s


Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-07-01 00:00:33.550000,36.156678,-86.809004,Powered635135,Powered,22.0,scooter,0.15,Lyft
1,2019-07-01 00:00:34.973000,36.145674,-86.794138,Powered790946,Powered,33.0,scooter,0.15,Lyft
2,2019-07-01 00:00:41.183000,36.179319,-86.751538,Powered570380,Powered,76.0,scooter,0.15,Lyft
3,2019-07-01 00:00:41.620000,36.152111,-86.803821,Powered240631,Powered,43.0,scooter,0.15,Lyft
4,2019-07-01 00:00:45.087000,36.149355,-86.79755,Powered970404,Powered,52.0,scooter,0.15,Lyft


In [30]:
july.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25075445 entries, 0 to 25075444
Data columns (total 9 columns):
 #   Column       Dtype  
---  ------       -----  
 0   pubdatetime  object 
 1   latitude     float64
 2   longitude    float64
 3   sumdid       object 
 4   sumdtype     object 
 5   chargelevel  float64
 6   sumdgroup    object 
 7   costpermin   float64
 8   companyname  object 
dtypes: float64(4), object(5)
memory usage: 1.7+ GB


In [31]:
july.companyname.unique()

array(['Lyft', 'Bird', 'Spin', 'Bolt', 'Jump', 'Lime', 'Gotcha'],
      dtype=object)

In [32]:
company_dict = {'Bird':0, 'Lyft': 1, 'Gotcha': 2, 'Lime': 3, 'Spin': 4, 'Jump': 5, 'Bolt': 6}

In [33]:
july.companyname = july.companyname.replace(company_dict)

In [34]:
july.pubdatetime = pd.to_datetime(july.pubdatetime)
july.head(3)

Unnamed: 0,pubdatetime,latitude,longitude,sumdid,sumdtype,chargelevel,sumdgroup,costpermin,companyname
0,2019-07-01 00:00:33.550,36.156678,-86.809004,Powered635135,Powered,22.0,scooter,0.15,1
1,2019-07-01 00:00:34.973,36.145674,-86.794138,Powered790946,Powered,33.0,scooter,0.15,1
2,2019-07-01 00:00:41.183,36.179319,-86.751538,Powered570380,Powered,76.0,scooter,0.15,1


In [35]:
july.sumdgroup.unique()

array(['scooter', 'Scooter', 'bicycle'], dtype=object)

In [36]:
july_scooters = july.loc[july.sumdgroup.isin(['scooter', 'Scooter'])]

In [37]:
july_scooters = july_scooters[['pubdatetime', 'latitude', 'longitude', 'sumdid', 'chargelevel', 'companyname']]

In [38]:
july_scooters.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25066524 entries, 0 to 25075444
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   pubdatetime  datetime64[ns]
 1   latitude     float64       
 2   longitude    float64       
 3   sumdid       object        
 4   chargelevel  float64       
 5   companyname  int64         
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 1.3+ GB


In [39]:
july_scooters.to_pickle("../data/july.pkl")

In [40]:
%%time
july_test = pd.read_pickle("../data/july.pkl")

Wall time: 1.48 s


All 3 large data sets have now been pickled!