# 2. Title Generation - Preparing the Data

This is a quick side notebook that shows how we get our data.

---


I have a Mongo Database with data from a set of G06 US Patent Publications. On my system 10000 data samples take up around 16MB (16724852 bytes). It thus seems possible to start with a dataset of 30000 samples. I have used this to generate a pickle file with the data that may be downloaded from the 

In [1]:
from pymongo import MongoClient
client = MongoClient('mongodb', 27017)
db = client.patent_db

def data_generator(sample_size=1000):
    """ Return a generator that provides Patent Doc objects."""
    cursor = db.patents.aggregate(
                [{"$sample": {"size": sample_size}}],
                allowDiskUse=True
            )
    return [(d['claims'][0]['text'], d['title']) for d in cursor]

In [2]:
data = data_generator(sample_size=30000)

In [3]:
len(data)

30000

In [4]:
# from here - https://goshippo.com/blog/measure-real-size-any-python-object/

import sys

def get_size(obj, seen=None):
    """Recursively finds size of objects"""
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size

In [5]:
get_size(data)

49460692

So 30,000 data items take up around 50 MB.

In [6]:
import pickle

PIK = "claim_and_title.data"

with open(PIK, "wb") as f:
    pickle.dump(data, f)