## Imports

In [1]:
import numpy as np
import pandas as pd
import pickle

## Create fake data

In [15]:
n_obs = 100
fake_data = {'age': np.random.randint(25,100,n_obs), 
             'gender': np.random.choice(['female','male'], size=n_obs, replace=True),
             'm_status': np.random.choice(['single','married','widow'], size=n_obs, replace=True),
             'profession': np.random.choice(['accountant','lawyer','dentist','doctor','data scientist'], 
                                            size=n_obs, replace=True)}

In [16]:
df = pd.DataFrame(fake_data)
df.head(10)

Unnamed: 0,age,gender,m_status,profession
0,30,female,single,data scientist
1,72,male,widow,data scientist
2,95,female,married,dentist
3,56,male,single,doctor
4,95,male,widow,lawyer
5,83,male,married,lawyer
6,55,female,widow,data scientist
7,71,female,widow,accountant
8,79,female,widow,lawyer
9,85,female,single,doctor


## Subset Data

In [19]:
subset = df[(df.gender == 'female') & (df.age < 75) & (df.profession == 'data scientist')]
subset

Unnamed: 0,age,gender,m_status,profession
0,30,female,single,data scientist
6,55,female,widow,data scientist
46,58,female,married,data scientist
67,31,female,married,data scientist
70,70,female,single,data scientist
81,70,female,married,data scientist
86,39,female,single,data scientist
92,30,female,married,data scientist
96,69,female,single,data scientist


---

What if I had lots of data and didn't want to rerun the subset portiont? 

Is there a way to save this subsetted dataframe and load it later when I need it?

**Yes, pickle!**

From the [docs](https://docs.python.org/3/library/pickle.html): 
>The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.    
>
>**Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.**

In [22]:
# set path for convenience
path = '/Users/davidziganto/Repositories/Data_Science_Fundamentals/pkl_files'

In [24]:
# Save
with open(path + 'subset_df.pkl', 'wb') as picklefile:
    pickle.dump(subset, picklefile)

In [25]:
del subset

In [27]:
# Open
with open(path + "subset_df.pkl", 'rb') as picklefile: 
    test = pickle.load(picklefile)

In [28]:
test

Unnamed: 0,age,gender,m_status,profession
0,30,female,single,data scientist
6,55,female,widow,data scientist
46,58,female,married,data scientist
67,31,female,married,data scientist
70,70,female,single,data scientist
81,70,female,married,data scientist
86,39,female,single,data scientist
92,30,female,married,data scientist
96,69,female,single,data scientist


Viola. Now I can pickup where I left off without having to run through all that processing.