Title: 2. Title Generation - Preparing the Data
Tags: preparing_data
Authors: Ben Hoyle
Summary: This post looks at how we prepare the data for our title generation experiments.

# 2. Title Generation - Preparing the Data

In the previous post we looked at the problem for our current project: how to generate a patent title based on our claim text.

In this post we will look at the steps required to prepare some data for our machine learning algorithms. These steps roughly follow the guide [here](https://machinelearningmastery.com/how-to-prepare-data-for-machine-learning/):

- Step 1: Select Data
- Step 2: Preprocess Data
- Step 3: Transform Data

These steps can take up a fair proportion of the project time. The idea is to obtain a manageable data set and place it in a form where we can apply common machine learning libraries.

---
## 1. Select Data

I have a Mongo Database with data from a set of G06 US Patent Publications. On my system 10000 data samples take up around 16MB (16724852 bytes). It thus seems possible to start with a dataset of 30000 samples. I have used this to generate a pickle file with the data that may be downloaded from the 

In [16]:
import pickle
import os

PIK = "claim_and_title.data"

if os.path.isfile(PIK):
    with open(PIK, "rb") as f:
        print("Loading data")
        data = pickle.load(f)
        print("{0} samples loaded".format(len(data)))
else:
    !wget https://benhoyle.github.io/notebooks/

In [17]:
data = data_generator(sample_size=10000)

In [19]:
len(data)

10000

In [21]:
# from here - https://goshippo.com/blog/measure-real-size-any-python-object/

import sys

def get_size(obj, seen=None):
    """Recursively finds size of objects"""
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size

In [22]:
get_size(data)

16724852

So 10,000 data items take up around 16 MB.