## Imports

In [11]:
import numpy as np
import pandas as pd
import pickle

## Create fake data

In [12]:
np.random.seed(24)
n_obs = 100
fake_data = {'age': np.random.randint(25,100,n_obs), 
             'gender': np.random.choice(['female','male'], size=n_obs, replace=True),
             'm_status': np.random.choice(['single','married','widow'], size=n_obs, replace=True),
             'profession': np.random.choice(['accountant','lawyer','dentist','doctor','data scientist'], 
                                            size=n_obs, replace=True)}

In [13]:
df = pd.DataFrame(fake_data)
df.head(10)

Unnamed: 0,age,gender,m_status,profession
0,59,male,married,data scientist
1,28,male,married,data scientist
2,89,female,widow,lawyer
3,42,female,widow,dentist
4,42,female,married,dentist
5,26,female,widow,data scientist
6,29,male,widow,doctor
7,36,male,widow,data scientist
8,40,female,married,accountant
9,98,female,married,dentist


## Subset Data

In [14]:
subset = df[(df.gender == 'female') & (df.age < 75) & (df.profession == 'data scientist')]
subset

Unnamed: 0,age,gender,m_status,profession
5,26,female,widow,data scientist
11,32,female,widow,data scientist
12,50,female,married,data scientist
14,53,female,widow,data scientist
30,58,female,single,data scientist
42,25,female,married,data scientist
81,63,female,married,data scientist
92,47,female,widow,data scientist


---

What if I had lots of data and didn't want to rerun the subset portion? 

Is there a way to save this subsetted dataframe and load it later when I need it?

**Yes, pickle!**

From the [docs](https://docs.python.org/3/library/pickle.html): 
>The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.    
>
>**Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.**
>
>The following types can be pickled:
>
* None, True, and False  
* integers, floating point numbers, complex numbers  
* strings, bytes, bytearrays  
* tuples, lists, sets, and dictionaries containing only picklable objects  
* functions defined at the top level of a module (using def, not lambda)  
* built-in functions defined at the top level of a module  
* classes that are defined at the top level of a module  
* instances of such classes whose __dict__ or the result of calling __getstate__() is picklable 

In [15]:
# set path for convenience
path = '/Users/davidziganto/Repositories/Data_Science_Fundamentals/pkl_files/'

In [16]:
# Save
with open(path + 'subset_df.pkl', 'wb') as picklefile:
    pickle.dump(subset, picklefile)

In [17]:
# show test doesn't exist yet
try:
    print(test)
except:
    print('test does not exist!')

    age  gender m_status      profession
5    26  female    widow  data scientist
11   32  female    widow  data scientist
12   50  female  married  data scientist
14   53  female    widow  data scientist
30   58  female   single  data scientist
42   25  female  married  data scientist
81   63  female  married  data scientist
92   47  female    widow  data scientist


In [18]:
# Open
with open(path + "subset_df.pkl", 'rb') as picklefile: 
    test = pickle.load(picklefile)

In [19]:
test

Unnamed: 0,age,gender,m_status,profession
5,26,female,widow,data scientist
11,32,female,widow,data scientist
12,50,female,married,data scientist
14,53,female,widow,data scientist
30,58,female,single,data scientist
42,25,female,married,data scientist
81,63,female,married,data scientist
92,47,female,widow,data scientist


Viola. Now I can pickup where I left off without having to run through all that processing.

## Better Way (w/DF)

In [20]:
df.to_pickle(path + 'subset_df2.pkl', compression='gzip')