# Review Processing ETL
Example notebook for reading and transforming review data.  We read a source, remove SPI, and randomly split into known portions for training and testing data.

In [1]:
import pandas as pd  # data read
from sklearn import preprocessing  # data ETL
from sklearn.model_selection import train_test_split   # balanced partioning
import os,sys  # file checks
import pickle   # compressed results
import gzip  # compression 
import yaml   # configuration file
from sklearn.feature_extraction import text  # text processing
from datetime import datetime  # time processing
import ast # help with JSON parsing

## Configuration Options

It's handy to include configuration options in a standard file that can be quickly modified and rerun if you're training something new.  Of course, you can always use command-line configurations as well, but a handy set of defaults in a human-readable file might be a bit easier when you're running things in notebooks.

Here, we're using a simple [YAML](https://camel.readthedocs.io/en/latest/yamlref.html) file for our options which is human-readable, allows comments, and is well supported by other languages.

To modify this program's operation, just open the file `config.yaml` in your editor of choice and rerun this script.

In [10]:
config_path = 'config.yaml'
if not os.path.isfile(config_path):
    print("Sorry, can't find the configuration file {}, aborting.".format(config_path))
    sys.exit(-1)
config = yaml.safe_load(open(config_path))

## Data Exploration
First, let's load our data to see if we need to perform any transform operations.  We will use built-in [JSON reading functions from pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) that will parse json files into rows and columns to return a standardized [pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). 

Of course, you could use whatever library or load function you're used to, but these dataframes have nice interoperability properties with other libraries for learning and manipulation.

In [3]:
if not os.path.isfile(config["path"]["raw"]):
    print("Sorry, can't find the raw input file {}, aborting.".format(config["path"]["raw"]))
    os.exit(-1)

df_raw = pd.read_json(config["path"]["raw"], orient="records", lines=True)
df_raw.sample(5) # handy/pretty preview function within notebooks


Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
18784,B008FYGKJI,"[2, 3]",3,This is a basic multi-function printer.The set...,"01 10, 2013",AHS6PX6H22WW1,"H. Wang ""jwangamazon""",A basic multi-function printer with two-side p...,1357776000
13617,B004FQRW9W,"[4, 6]",4,I can verify what some other reviewers have no...,"01 24, 2012",A25UZ7MA72SMKM,Brent Butler,Powerful but may not stand the test of time,1327363200
20854,B00BEYXGNY,"[0, 0]",3,We use lots of colored paper for our art const...,"07 2, 2013",A1TS45JWJVOSSW,"Duane Sparks ""Duane""",Not the best for art projects,1372723200
22735,B00E5JUEF8,"[1, 1]",5,What else do you need? The pen performs as a ...,"12 30, 2013",A18LBGL7L9FEZV,Larry,Pen and stylus in one,1388361600
11939,B003KGBGO0,"[0, 0]",5,The entire back side of this mini white board ...,"04 3, 2014",A13BX9O5UDBILC,Jong Lee,strong magnet backing,1396483200


## Data Join
This is an okay set of data, but we can actually pull in some other metadata to add more product information.  As with real databases and data feeds, this may occasionally be necessary, but there should always be a "key" or  column that uniquely identifies rows.  Here, it's the column `asin` which is the inventory or product number.

In [4]:
if not os.path.isfile(config["path"]["metadata"]):
    print("Sorry, can't find the metadata input file {}, aborting.".format(config["path"]["metadata"]))
    os.exit(-1)

# a bit of a quirk, the metadata here uses single quotes (ugh!), which is not standard json
# so we must first load and transform that data; see the tip at https://stackoverflow.com/a/48593076
with gzip.open(config["path"]["metadata"], "rt") as f:
    df_meta = pd.DataFrame(ast.literal_eval("["+f.read().replace('\n', ',').replace('\r', '')+"]")).fillna('').astype(str)
    # also go from a string version of array to actual string array
    df_meta["categories"] = df_meta["categories"].apply(lambda x: ast.literal_eval(x.lower())[0])
    
df_meta.sample(5) # handy/pretty preview function within notebooks


Unnamed: 0,asin,brand,categories,description,imUrl,price,related,salesRank,title
72949,B004PXMSSK,,"[Office Products, Office & School Supplies, Pa...",The word expressed in these cards can be used ...,http://ecx.images-amazon.com/images/I/31iGdoPj...,,,{'Health & Personal Care': 454062},"Atsui Cards, Box Set of 6 Note Cards, Sympathy..."
48836,B002T45X20,,"[Office Products, Office & School Supplies, De...",Cooler Master Storm CS-M Weapon of Choice M4 D...,http://ecx.images-amazon.com/images/I/41izeWLp...,,,,Cooler Master Storm CS-M Weapon of Choice M4 D...
106545,B009P15RUS,Scotch&reg;,"[Office Products, Office & School Supplies, Ta...",Scotch Magic Greener Tape 812-24P contains 24 ...,http://ecx.images-amazon.com/images/I/414a9JiP...,25.0,,,"Scotch Magic Greener Tape, 3/4 x 900 Inches (8..."
6970,B000095S4P,Neenah,"[Office Products, Office & School Supplies, Pa...",,http://ecx.images-amazon.com/images/I/41xGcdpV...,12.93,"{'also_bought': ['B000J0B91U', 'B006X3PWV0', '...",,"Neenah Astrobrights Premium Color Card Stock, ..."
96555,B0083TQJFU,,"[Office Products, Office & School Supplies, Ca...",,http://ecx.images-amazon.com/images/I/61kTmdVC...,11.64,{'also_bought': ['1449415911']},,Thomas Kinkade Gardens of Grace with Scripture...


In [5]:
# now we join by ASIN to the raw data
print("Raw samples ({}), metadata records ({})".format(len(df_raw), len(df_meta)))
df_raw = df_raw.set_index("asin").join(df_meta.set_index("asin"), on="asin", how="left")
df_meta = None
print("Combined samples ({})".format(len(df_raw)))
df_raw.sample(5)

Raw samples (25374), metadata records (134838)
Combined samples (25374)


Unnamed: 0_level_0,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,brand,categories,description,imUrl,price,related,salesRank,title
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
B004UMNN2G,"[0, 0]",4,This is a good pack of hanging file folders. I...,"09 22, 2011",A2K89R0B20LYHB,Christine,Color useful but paper is thin,1316649600,Pendaflex,"[Office Products, Office & School Supplies, Fi...",New and Improved Ready Tab. Superior Durabilit...,http://ecx.images-amazon.com/images/I/41OHA5RM...,11.33,"{'also_bought': ['B00016UVP2', 'B000WXBDZG', '...",,"Pendaflex Ready-Tab Hanging File Folder, Assor..."
B0039MZZGK,"[0, 0]",5,"I'll be about the 500th reviewer to say ""cute""...","04 19, 2011",A37D2TGTIXRV2N,"plyopowerd ""Arrow Dynamic Mom""",Sweet accessory for shoe or Dorothy fans,1303171200,Scotch,"[Office Products, Office & School Supplies, Ta...","In 1930, 3M engineer Richard Drew invented an ...",http://ecx.images-amazon.com/images/I/41F3JOcZ...,6.99,"{'also_bought': ['B003VNE25M', 'B0086ZL1E0', '...",,"Red Shoe Scotch Magic Tape Dispenser, 3/4 x 35..."
B00AHV7MJO,"[0, 0]",5,"Oh, I loathe those little plastic tabs that go...","04 28, 2013",A1UQBFCERIP7VJ,Margaret Picky,Brilliant!,1367107200,Smead,"[Office Products, Office & School Supplies, Fi...",Erasable FasTab hanging folders have a special...,http://ecx.images-amazon.com/images/I/41vkqOGv...,15.69,"{'also_bought': ['B0013COEEW', 'B000GRA91W', '...",,"Smead Erasable FasTab Hanging Folders, 1/3-Cut..."
B007XPBW4I,"[1, 1]",5,"The last time I moved, the movers charged me t...","07 2, 2013",A5GPH59NDWJRB,Jenna of the Jungle,Sturdy,1372723200,Bankers Box,"[Office Products, Office & School Supplies, En...",Wardrobe boxes are designed specifically for m...,http://ecx.images-amazon.com/images/I/41w1E4jc...,65.7,"{'also_bought': ['B002A9JQSG', 'B007XPBW3Y', '...",,"Bankers Box SmoothMove Wardrobe Box, 24 x 24 x..."
B0006OF5MI,"[0, 0]",3,I had expected this sorter to be of the accord...,"04 21, 2014",ADY836HK6QSYR,"ardnam ""ardnam""",Not exactly what I thought,1398038400,Wilson Jones,"[Office Products, Office & School Supplies, Fi...",Book style expandable sorter easily organizes ...,http://ecx.images-amazon.com/images/I/411mWImP...,14.33,"{'also_bought': ['B002Q8HXZ4', 'B000J09O1W', '...",,"Wilson Jones Favorite Desk File/Sorter, A-Z In..."


In [6]:
print(df_raw.iloc[0]) # take a closer look at just one row/sample

helpful                                                      [0, 0]
overall                                                           5
reviewText        This is a really good, high-quality product.  ...
reviewTime                                              10 29, 2010
reviewerID                                           A1P2XYD265YE21
reviewerName                                    Andrea "Readaholic"
summary                                                 Really Good
unixReviewTime                                           1288310400
brand                                                         Avery
categories        [Office Products, Office & School Supplies, Pa...
description       Perfect for invitations, thank you notes, movi...
imUrl             http://ecx.images-amazon.com/images/I/51q11PWY...
price                                                          8.42
related           {'also_bought': ['B0000AQNVK', 'B00007E7CW', '...
salesRank                                       

## Partitions
To avoid data bias and contamination, let's break apart the data into `training` and `testing` (aka `evaluation`) sets. These partitions help us to avoid the problems of over-fitting and under-fitting while training for our problem.  Additionally, by doing the segmentation this early in model building, we'll be able to keep consistent samples for comparing different models that are evaluated.  In the special function `train_test_split` we specify the column `overall` so that we get a set of balanced classes in our training and testing data.

In some works, a third split often called `validation` can be used for parameter tuning after training but before testing on unseen data.

Here are a few quick ETL steps we're performing...

1. We also see that a few fields, `reviewerID` and `reviewerName` are likely SPI (sensitive personal information) that we don't need in any of our analysis, so we will summarily `delete` it. 

2. There is a redundant field `reviewTime` that is a repeated text form of `unixReviewTime` so we can delete it, too.

3. We don't want our model to learn anything specific to unique products, so we will also drop the column `brand`, `price`, `salesRank`, `imUrl`, `related`, `title`, and `asin`.


In [9]:
if "reviewerID" in df_raw.columns:
    df_raw.drop(["reviewerID", "reviewerName", "reviewTime", "brand", "price", "salesRank", "imUrl", "related", "title"], inplace=True, axis=1)
df_raw.reset_index(drop=True, inplace=True)
    
# now compare to the prior data above
df = {}
df["X_train"], df["X_test"] = train_test_split(df_raw, stratify=df_raw["overall"],
                                        test_size=config["partition"]["test"], 
                                        random_state=0)
models = {}  # this will allow us to persist models
df["X_test_enc"] = df["X_test"].copy()
print(df["X_train"].iloc[0]) # take a closer look at just one row/sample (after dropping)
print("Dimensionality before processing {}".format(df["X_train"].shape))


helpful                                                      [1, 1]
overall                                                           4
reviewText        Oh my Avery, how I LOVE you!!  I am a total ne...
summary           Another wonderful Avery product that feeds my ...
unixReviewTime                                           1287619200
categories        [Office Products, Office & School Supplies, La...
description                                                        
Name: 665, dtype: object
Dimensionality before processing (20299, 7)


## Transforms
From the above explorations, we can see that we'll need to transform the data into a numerical representation before we can continue. 


### Textual Transforms
For text data, there are a million ways you can go from text to numerical featues, with ealy, tried and true techniques like [frequency vectors](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and [hashing vectors](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) where all words are counted and then assigned to a fixed feature dimension to more complex methods like [word embedding](https://en.wikipedia.org/wiki/Word_embedding).  

There are also tons of pre-processing steps that can be added like [stemming](https://pythonspot.com/nltk-stemming/) and [stop word removal](https://pythonspot.com/nltk-stop-words/), but those additional steps are left as exercises for the reader.

In this sample ETL code, we'll extract [normalized TFIDF counts](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) with English stop words and a maximum number of textual features, which is just one step above the simpler counts alone.  A richer demonstration with this library can be [found here](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py).

### Categorical Transforms
For categorical data, there are two popular methods to encode non-numerical data: [one hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and simply [label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder).  

Label encoding is the simpler of the two and it enumerates all of your known values into numerical values.  It offers smaller final dimensionality and if the values are releated, it might make sense to keep them ordinally related.  For example, mapping the quality grades "poor, acceptable, good" into a single ordinal.
```
    transform([['apple']; ['orange']; ['bannana']]) --> [[0]; [1]; [2]]
```

One-hot encoding also enumerates all known values into binary vectors.  Some classifiers learn depenencies between numerical values, so assigning several unrelated numbers could actually hurt your case.
```
    transform([['apple']; ['orange']; ['bannana']]) --> [[1 0 0]; [0 1 0]; [0 0 1]]
```

This data actually has the column `categories` which can contain multiple categories (e.g. `["chair","office"]`, so we'll run the transform there but use a [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) which works like one-hot encoding, but allows multiple items to be simultaneously active.


In [11]:

# train text models for both big text columns
for c in ["reviewText","summary","description"]:
    print("Preprocessing text column '{}'...".format(c))
    encN = "enc_text_{}".format(c)
    encC = "enc_cols_{}".format(c)
    # create and train vectorizer
    models[encN] = text.TfidfVectorizer(
        stop_words="english", max_features=config["encoding"]["max_text_terms"])
    models[encN].fit(df["X_train"][c])
    # encode new features to dataframe
    models[encC] = ["{}_{}".format(c[0:2], n) for n in models[encN].get_feature_names()]
    tmp_encode = pd.DataFrame(models[encN].transform(df["X_train"][c]).toarray(), 
                              index=df["X_train"].index, columns=models[encC])    
    df["X_train"] = pd.concat([df["X_train"], tmp_encode], axis=1, sort=False)
    # repeat for test data (except we didn't train with it)
    tmp_encode = pd.DataFrame(models[encN].transform(df["X_test_enc"][c]).toarray(), 
                              index=df["X_test_enc"].index, columns=models[encC])    
    df["X_test_enc"] = pd.concat([df["X_test_enc"], tmp_encode], axis=1, sort=False)
    # delete prior columns
    df["X_train"].drop(c, axis=1, inplace=True)
    df["X_test_enc"].drop(c, axis=1, inplace=True)
    print("... dimensionality after processing {}".format(df["X_train"].shape))

# now, normalize the list of helpfulness of the review scores 
#   (e.g. [1,2] which is "Yes/No") into a l1 unit vector of counts [0.2, 0.8] 
print("Preprocessing vectorized 'helpful' ...")
tmp_encode = preprocessing.normalize(list(df["X_train"]["helpful"]), norm='l1')
tmp_encode = pd.DataFrame(tmp_encode, columns=["help_0", "help_1"], index=df["X_train"].index)
df["X_train"] = pd.concat([df["X_train"], tmp_encode], axis=1, sort=False)
tmp_encode = preprocessing.normalize(list(df["X_test_enc"]["helpful"]), norm='l1')
tmp_encode = pd.DataFrame(tmp_encode, columns=["help_0", "help_1"], index=df["X_test_enc"].index)
df["X_test_enc"] = pd.concat([df["X_test_enc"], tmp_encode], axis=1, sort=False)
# delete prior columns
df["X_train"].drop("helpful", axis=1, inplace=True)
df["X_test_enc"].drop("helpful", axis=1, inplace=True)

# one more tweak is to pull out 'day of week' from our timestamp
#   generally, you'd want to disregard all time information for a fair learning
#   but some studies demonstrate social trends over time
# https://hbr.org/2018/10/research-why-ratings-on-everything-from-wine-to-amazon-products-improve-over-time


Preprocessing text column 'reviewText'...
... dimensionality after processing (20299, 506)
Preprocessing text column 'summary'...
... dimensionality after processing (20299, 1005)
Preprocessing text column 'description'...
... dimensionality after processing (20299, 1504)
Preprocessing vectorized 'helpful' ...


In [25]:
# specify the columns of interest
print("Preprocessing one-hot category columns...")
models["col_hot"] = "categories"
# train and transform columns
models["enc_hot"] = preprocessing.MultiLabelBinarizer()
models["enc_hot"].fit(df["X_train"][models["col_hot"]])
# compute new column names after encoding
models["col_hot_enc"] = ["cat_{}".format(x) for x in models["enc_hot"].classes_]
# combine into new larger matrix
x = models["enc_hot"].transform(df["X_train"][models["col_hot"]])
tmp_encode = pd.DataFrame(models["enc_hot"].transform(df["X_train"][models["col_hot"]]), 
                          index=df["X_train"].index, columns=models["col_hot_enc"])
df["X_train"] = pd.concat([df["X_train"], tmp_encode], axis=1, sort=False)
tmp_encode = pd.DataFrame(models["enc_hot"].transform(df["X_test_enc"][models["col_hot"]]), 
                          index=df["X_test_enc"].index, columns=models["col_hot_enc"])
df["X_test_enc"] = pd.concat([df["X_test_enc"], tmp_encode], axis=1, sort=False)
# delete prior columns for raw data
df["X_train"].drop(models["col_hot"], axis=1, inplace=True)
df["X_test_enc"].drop(models["col_hot"], axis=1, inplace=True)
print("... dimensionality after processing {}".format(df["X_train"].shape))

# another way to do category encoding with just two values (e.g. On/Off)
# # specify the columns of interest
# models["col_binary"] = ["Partner", "Dependents", "PhoneService", "InternetService", \
#                   "PaperlessBilling", "Churn"]
# # train and transform columns
# models["enc_binary"] = preprocessing.OrdinalEncoder()  # this is a multi-feature version of LabelEncoder
# models["enc_binary"].fit(df["X_train"][models["col_binary"]])
# # overwrite previous data with transformation
# df["X_train"][models["col_binary"]] = models["enc_binary"].transform(df["X_train"][models["col_binary"]])
# df["X_test_enc"][models["col_binary"]] = models["enc_binary"].transform(X_tdf["X_test_enc"]est_enc[models["col_binary"]])



Preprocessing one-hot category columns...
... dimensionality after processing (20299, 1713)


### Time Transforms
One more tweak is to pull out 'day of week' from our timestamp. Generally, you'd want to disregard all time information for a fair learning but some [studies demonstrate positive skews over time](https://hbr.org/2018/10/research-why-ratings-on-everything-from-wine-to-amazon-products-improve-over-time).  

Here, we're keeping both the raw timestamp (to account for above) and a day of week as an example of how to process and extract a time segment.


In [26]:
print("Preprocessing vectorized 'unixReviewTime' ...")
df_time = pd.to_datetime(df["X_train"]["unixReviewTime"], unit='s')
df["X_train"]["reviewDOW"] = df_time.dt.dayofweek
df_time = pd.to_datetime(df["X_test_enc"]["unixReviewTime"], unit='s')
df["X_test_enc"]["reviewDOW"] = df_time.dt.dayofweek
print("... dimensionality after processing {}".format(df["X_train"].shape))


Preprocessing vectorized 'unixReviewTime' ...
... dimensionality after processing (20299, 1714)


### Excise your labels
Don't forget to exclude class labels from the training data or else you'll get 100% accuracy and probably a slap on the wrist from your model development friends!

In [27]:
# finally, let's grab our label column ("Churn") that we're trying to predict
df["y_train"] = df["X_train"]["overall"]
del df["X_train"]["overall"]
df["y_test"] = df["X_test_enc"]["overall"]
del df["X_test"]["overall"]
del df["X_test_enc"]["overall"]

print("Final dimensionality after processing {}".format(df["X_train"].shape))


Final dimensionality after processing (20299, 1713)


In [28]:
# finally we should able to cast all data to a double format
# however, some values for "TotalCharges" were missing, so set them to zero
df["X_train"] = df["X_train"].apply(pd.to_numeric, errors='corce').fillna(0).astype(float)
df["X_test_enc"] = df["X_test_enc"].apply(pd.to_numeric, errors='corce').fillna(0).astype(float)

# now compare to the prior data above
df["X_train"].head(10)

Unnamed: 0,unixReviewTime,re_10,re_100,re_11,re_12,re_20,re_30,re_34,re_3m,re_8217,...,cat_Utility Carts,cat_Velcro & Mounting Products,cat_View Binders,cat_VoIP,cat_Wall Calendars,cat_Wirebound Notebooks,cat_Wooden Colored Pencils,cat_Wrist Rests,cat_Writing & Correction Supplies,reviewDOW
665,1287619000.0,0.0,0.0,0.0,0.048788,0.047167,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
23034,1393200000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8253,1264032000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.558033,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
1963,1304554000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
22147,1379203000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
14291,1313539000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
11506,1292285000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3696,1292285000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7324,1287619000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
16380,1331770000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0


In [29]:
# finally, let's save all of this data to disk
df["X_train"].sample(100).to_csv(config["path"]["example"], index=False)

# write out our larger datasets as binary files
with gzip.open(config["path"]["etl"], 'wb') as f:
    pickle.dump(df, f)

# and write out our intermediate model data (in case we need to transform again)
with open(config["path"]["model"], 'wb') as f:
    pickle.dump(models, f)

# write out some stats
print("Created encoded training set {}x{} (labels {}) " \
      "and raw test set {}x{} (partition size {}) ".format( \
    len(df["X_train"]), len(df["X_train"].columns), len(df["y_train"]), 
    len(df["X_test"]), len(df["X_test"].columns), config["partition"]["test"]))

Created encoded training set 20299x1713 (labels 20299) and raw test set 5075x6 (partition size 0.2) 


## Normalization and scaling
There are other handy tricks to [normalize or scale](https://en.wikipedia.org/wiki/Normalization_(statistics)) your data, but we'll leave that as an exercise for the future.  Some learning algorithms (like deep neural networks) will derive features on their own so it isn't as necessary to preprocess the raw values in this way.