## Imports

In [1]:
import numpy as np
import pandas as pd
import pickle

## Create fake data

In [2]:
n_obs = 100
fake_data = {'age': np.random.randint(25,100,n_obs), 
             'gender': np.random.choice(['female','male'], size=n_obs, replace=True),
             'm_status': np.random.choice(['single','married','widow'], size=n_obs, replace=True),
             'profession': np.random.choice(['accountant','lawyer','dentist','doctor','data scientist'], 
                                            size=n_obs, replace=True)}

In [3]:
df = pd.DataFrame(fake_data)
df.head(10)

Unnamed: 0,age,gender,m_status,profession
0,80,male,single,data scientist
1,37,female,single,data scientist
2,36,male,single,lawyer
3,90,female,widow,doctor
4,87,female,widow,dentist
5,48,male,married,lawyer
6,87,female,widow,accountant
7,75,female,single,lawyer
8,75,female,married,dentist
9,68,male,widow,lawyer


## Subset Data

In [4]:
subset = df[(df.gender == 'female') & (df.age < 75) & (df.profession == 'data scientist')]
subset

Unnamed: 0,age,gender,m_status,profession
1,37,female,single,data scientist
20,55,female,single,data scientist
67,37,female,widow,data scientist
68,44,female,single,data scientist
82,72,female,widow,data scientist
83,40,female,single,data scientist
87,70,female,widow,data scientist


---

What if I had lots of data and didn't want to rerun the subset portion? 

Is there a way to save this subsetted dataframe and load it later when I need it?

**Yes, pickle!**

From the [docs](https://docs.python.org/3/library/pickle.html): 
>The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.    
>
>**Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.**
>
>The following types can be pickled:
>
* None, True, and False  
* integers, floating point numbers, complex numbers  
* strings, bytes, bytearrays  
* tuples, lists, sets, and dictionaries containing only picklable objects  
* functions defined at the top level of a module (using def, not lambda)  
* built-in functions defined at the top level of a module  
* classes that are defined at the top level of a module  
* instances of such classes whose __dict__ or the result of calling __getstate__() is picklable 

In [5]:
# set path for convenience
path = '/Users/davidziganto/Repositories/Data_Science_Fundamentals/pkl_files/'

In [6]:
# Save
with open(path + 'subset_df.pkl', 'wb') as picklefile:
    pickle.dump(subset, picklefile)

In [7]:
del subset

In [8]:
# Open
with open(path + "subset_df.pkl", 'rb') as picklefile: 
    test = pickle.load(picklefile)

In [9]:
test

Unnamed: 0,age,gender,m_status,profession
1,37,female,single,data scientist
20,55,female,single,data scientist
67,37,female,widow,data scientist
68,44,female,single,data scientist
82,72,female,widow,data scientist
83,40,female,single,data scientist
87,70,female,widow,data scientist


Viola. Now I can pickup where I left off without having to run through all that processing.