In [2]:
import pandas as pd
import pickle

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Introduction to Pickling

---

## Learning objectives

- Learn what serialization and deserialization are
- Learn what "pickling" is in Python
- Review using `with` statements to safely handle file operations
- Pickle and unpickle sklearn models in Python

---

## What is pickling?

If you're talking about food, pickling is a method of preserving food for the future. If you're talking about Python, pickling is a method of preserving **objects** for the future, including functions and classes. Since sklearn models are instances of classes, that means they can be pickled.

To pickle an object, it needs to be **serialized**. Serialization is when we transform an object into byte streams. (Byte streams are collections of bytes. One byte is made up of eight zeros or ones.) To unpickle an object so that it can be used in Python again, it needs to be **deserialized**.

If you've ever saved your progress in a video game, you've already serialized data without knowing it! A save file is your serialized save state. When you load the save, you deserialize the data so you can resume the game right where you were before you quit.

### Warning:

Just like you can't open a [Pokemon: Red](https://en.wikipedia.org/wiki/Pok%C3%A9mon_Red_and_Blue) savefile in [Pokemon: Sun](https://en.wikipedia.org/wiki/Pok%C3%A9mon_Sun_and_Moon), you have to unpickle an object in the same version of Python that you pickled it in. 

Pickle objects can contain malevolent code. Never unpickle an object you don't trust!

## Why pickle?

Pickling makes a lot of sense any time you have a model you want to work with that you don't want to refit.

If you have a model that took twelve hours to fit, you might want to analyze its residuals, work with its coefficients, or make predictions off of it. But without Pickle, you'd need to refit the model every time you restarted your notebook. Pickling the model allows you to load the fitted model _without_ needing to re-run the code where you fit it.

Note: pickling does **not** compress your model, meaning that some pickled models can end up being fairly large file sizes. (Think of K-nearest neighbors, which requires **every** data point to be stored inside the model.)

---

## Pickling a simple datatype

Before we pickle a full model, let's demonstrate pickling on a simple list.

Create a list called `things_to_pickle` that contains some strings:

In [3]:
things_to_pickle = ["cucumbers", "pigs\' feet", "beets", "a peck of peppers"]

### Write the pickled list to disk

Let's review [this link](https://www.pythonforbeginners.com/files/with-statement-in-python) to go over why `with` is such a good tool for file operations.

Let's use `with` to write the list to disk as a `.pkl` file. We'll need to use `open`, pass in a file name, and also tell Python we're **writing** to the file, and writing as **bytes**. The pickle method we'll use is called `dump`.

In [4]:
with open('data/things_to_pickle.pkl', 'wb') as pickle_out:
    pickle.dump(things_to_pickle, pickle_out)

### Open the pickled list

Let's use `with` to open the pickled file and save it as a new variable, `list_from_pickle`. Remember to tell Python that we're **reading** from the file, and that we're reading in **bytes**. The pickle method we'll use is called `load`.

In [None]:
with open('data/things_to_pickle')

So far, so good!

---

## Pickle a fitted pipeline

Here, we'll fit a pipeline to the Trump-Clinton corpus, then pickle the model.

Then, we'll import the pickle in a new notebook to demonstrate how the model has been saved. Some code has been provided for you.

### Import the data

Here, the data is imported, and some elementary cleaning is performed:

In [5]:
df = pd.read_csv('data/trump_clinton_tweets.csv')
df = df[df['is_retweet'] == False][['text', 'handle']]
df.head(3)

Unnamed: 0,text,handle
0,The question in this election: Who can put the...,HillaryClinton
3,"If we stand together, there's nothing we can't...",HillaryClinton
4,Both candidates were asked about how they'd co...,HillaryClinton


### Set up train and test

In [6]:
X = df['text']
y = df['handle'].map(lambda x: 1 if x == 'realDonaldTrump' else 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

### Instantiate, fit, and score a pipeline

In [7]:
pipe = Pipeline([
    ('cv', CountVectorizer(min_df=3)),
    ('lr', LogisticRegressionCV(cv=3, max_iter=1000))
])

pipe.fit(X_train, y_train)
pipe.score(X_train, y_train), pipe.score(X_test, y_test)

(0.9937077604288045, 0.9217330538085255)

### Export the fitted pipeline to `my_pickles` as `pipeline.pkl`:

Just like before, we'll use a `with` statement:

In [9]:
with open('data/tweet_pipeline.pkl', 'wb') as pickle_out:
    pickle.dump(pipe, pickle_out)

Now, open the notebook called `read_a_pickle`.