# Feature Transformation with Scikit-Learn In This Notebook
## Saving Features into the SageMaker Feature Store

In this notebook, we convert raw text into BERT embeddings.  This will allow us to perform natural language processing tasks such as text classification. We save the features into the SageMaker Feature Store.


![](img/prepare_dataset_bert.png)

# BERT Mania!

![BERT Mania](img/bert_mania.png)

# Understand BERT Embeddings

* Bidirectional Encoder Representations from Transformers [BERT](https://arxiv.org/abs/1810.04805)
* For more details on Transformers Architecture, see [Attention Is All You Need](https://arxiv.org/abs/1706.03762).

<img src="img/bert_embeddings.png" width="60%" align="left">

<img src="img/bert_input_features.png" width="80%" align="left">

* **input_ids**: 
The id from the pre-trained BERT vocabulary that represents the token. (Padding of 0 will be used if the # of tokens is less than max_seq_length)

* **input_mask**: 
Specifies which tokens BERT should pay attention to (0 or 1). Padded input_ids will have 0 in each of these vector elements.

* **segment_ids**: 
Segment ids are always 0 for single-sequence tasks such as text classification. 1 is used for two-sequence tasks such as question/answer and next sentence prediction.
  
* **label_id**: 
Label for each training row (star_rating 1 through 5)

In [11]:
import sagemaker
import boto3

sess = sagemaker.Session()
bucket = sess.default_bucket()
region = boto3.Session().region_name

import botocore.config

config = botocore.config.Config(
    user_agent_extra='dsoaws/1.0'
)

In [12]:
%store -r role

# Define Maximum Sequence Length for BERT
Maximum sequence length is chosen based on the number-of-word distribution for the review text.
![](img/max_seq_length_viz.png)

In [16]:
max_seq_length = 64

In [4]:
%pip install datasets transformers torch torchdata torcharrow

Keyring is skipped due to an exception: 'keyring.backends'
Collecting datasets
  Using cached datasets-2.9.0-py3-none-any.whl (462 kB)
Collecting transformers
  Using cached transformers-4.26.1-py3-none-any.whl (6.3 MB)
Collecting torch
Note: you may need to restart the kernel to use updated packages.


In [None]:
# from datasets import load_dataset_builder

# data_files = {
#               "train": ["./data-bloom/train/part-algo-1-womens_clothing_ecommerce_reviews.csv"],
#               "validation": ["./data-bloom/validation/part-algo-1-womens_clothing_ecommerce_reviews.csv"],
#               "test": ["./data-bloom/test/part-algo-1-womens_clothing_ecommerce_reviews.csv"]
#              }

# output_dir = "s3://dsoaws/bloom/data/"
# builder = load_dataset_builder("csv", data_files=data_files)
# builder.download_and_prepare(output_dir, file_format="parquet")

In [6]:
# from datasets import load_dataset
    

# dataset_train = load_dataset("amazon_us_reviews", "Digital_Video_Games_v1_00") #, data_files={"train": "Apparel_v1_00"})  #, "validation": path_to_validation.txt}

# dataset_validation = load_dataset("amazon_us_reviews", "Digital_Software_v1_00") #, data_files={"train": "Apparel_v1_00"})  #, "validation": path_to_validation.txt}

# dataset_train

Found cached dataset amazon_us_reviews (/root/.cache/huggingface/datasets/amazon_us_reviews/Digital_Video_Games_v1_00/0.1.0/17b2481be59723469538adeb8fd0a68b0ba363bbbdd71090e72c325ee6c7e563)


  0%|          | 0/1 [00:00<?, ?it/s]

Found cached dataset amazon_us_reviews (/root/.cache/huggingface/datasets/amazon_us_reviews/Digital_Software_v1_00/0.1.0/17b2481be59723469538adeb8fd0a68b0ba363bbbdd71090e72c325ee6c7e563)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'product_category', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 'review_headline', 'review_body', 'review_date'],
        num_rows: 145431
    })
})

In [None]:
# output_dir = "s3://dsoaws/amazon_us_reviews"
# builder = load_dataset_builder("amazon_us_reviews")
# builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")

In [None]:
# from datasets import load_dataset, load_from_disk, load_dataset_builder

# #data_files = {"train": "s3://dosaws/data/train/part-algo-1-womens_clothing_ecommerce_reviews.csv"}, {"validation": "s3://dsoaws/data/validation/part-algo-1-womens_clothing_ecommerce_reviews.csv"}
# #datasets = load_dataset("womens-clothing", data_files=data_files)

# dataset = load_from_disk("s3://dsoaws/parquet/")  #, storage_options=storage_options) 


In [None]:
# builder = load_dataset_builder("parquet", data_files=data_files)
# builder.download_and_prepare("s3://dsoaws/bloom/data/")


In [3]:
# import torchdata
# from torchdata.datapipes.iter import S3FileLister, S3FileLoader, IterableWrapper
 
# # S3_BUCKET = "s3://dsoaws/data/train/part-algo-1-womens_clothing_ecommerce_reviews.csv"

# #S3_BUCKET = "s3://dsoaws/parquet/"
    
# # #train_files = IterableWrapper([S3_BUCKET]).list_files_by_s3()
# # train_files = S3FileLister([S3_BUCKET])
# # train_files

# # print(next(iter(train_files)))

In [4]:
# from torchdata.datapipes.iter import S3FileLister, S3FileLoader, IterableWrapper

# s3_file_loader = S3FileLoader(train_files)
# s3_file_loader

S3FileLoaderIterDataPipe

In [None]:
# df = s3_file_loader.load_parquet_as_df(columns=['review_body'])
# df

In [2]:
from pathlib import Path
from typing import Any, Union
from urllib.parse import ParseResult, urlparse


class UrlPath:
    """
    Represents a path in on "some" storage specified by the ``scheme`` parameter.
    URLs with no ``scheme`` are assumed to be on a local filesystem.

    Example:

    .. doctest::

     >>> from mfive.util.urlpath import UrlPath
     >>> str(UrlPath("s3://foo/bar/baz"))
     's3://foo/bar/baz'

     # absolute path /tmp/foo/bar
     >>> UrlPath("/tmp/foo/bar") == UrlPath("file:///tmp/foo/bar")
     True

     # relative path $CWD/tmp/foo/bar
     >>> UrlPath("tmp/foo/bar") == UrlPath("file://tmp/foo/bar")
     True


    You can join paths using ``/`` (div) as:

    .. doctest::

     >>> from mfive.checkpoint.api import UrlPath
     >>> UrlPath("s3://foo/bar") / "baz"
     UrlPath(url="s3://foo/bar/baz")

     >>> str(UrlPath("file:///tmp/foo/bar") / "baz")
     '/tmp/foo/bar/baz'

     >>> UrlPath("/tmp/foo/bar") / "baz"
     UrlPath(url="/tmp/foo/bar/baz")

     >>> UrlPath("s3://foo/bar") / UrlPath("s3://baz")
     Traceback (most recent call last):
        ...
     ValueError: Cannot join `s3://foo/bar` with `s3://baz`

     >>> UrlPath("s3://foo/bar") / UrlPath("baz")
     Traceback (most recent call last):
        ...
     ValueError: Cannot join `s3://foo/bar` with `baz`

    """

    SCHEMES = {"file", "s3", "artifactkit", "datakit"}

    def __init__(self, url: str):
        parse_result: ParseResult = urlparse(url)
        self.scheme: str = parse_result.scheme
        if not self.scheme:
            self.scheme = "file"

        # pass it through pathlib.Path to clean the path of any trailing "/"
        p = Path(f"{parse_result.netloc}{parse_result.path}")
        self.path = str(p)
        self.filename = p.name
        if self.scheme not in self.SCHEMES:
            raise ValueError(
                f"`{self.scheme}://` is not supported. Valid schemes: {', '.join(self.SCHEMES)}"
            )

    def __str__(self) -> str:
        if self.scheme == "file":
            return self.path
        else:
            return f"{self.scheme}://{self.path}"

    def __repr__(self) -> str:
        return f'{UrlPath.__name__}(url="{self.__str__()}")'

    def __eq__(self, other: Any) -> bool:
        if isinstance(other, UrlPath):
            return self.scheme == other.scheme and self.path == other.path
        else:
            return False

    def __truediv__(self, other: Union[str, "UrlPath"]) -> "UrlPath":
        if isinstance(other, str):
            return UrlPath(f"{self.scheme}://{str(Path(self.path) / other)}")
        else:  # isinstance(other, UrlPath):
            if self.scheme == other.scheme and self.scheme == "file":
                return UrlPath(f"{self.scheme}://{str(Path(self.path) / other.path)}")
            raise ValueError(f"Cannot join `{self}` with `{other}`")

    @staticmethod
    def is_url(url_str: str) -> bool:
        """
        Checks if the given string is a URL. Note that this method does not check whether the string
        is a "valid" URL, it simply checks whether the string has a "scheme". If the URL string is
        malformed, this method raises an exception.

        Returns: ``True`` if the string has a URL scheme, ``False`` otherwise.

        """

        parse_result = urlparse(url_str)
        if parse_result.scheme:
            return True
        else:
            return False

In [3]:
"""
S3 Reader datapipe.
"""
import logging
from enum import Enum
from typing import Iterator, Tuple

import boto3

from torch.utils.data import functional_datapipe
from torch.utils.data.datapipes.utils.common import StreamWrapper
from torchdata.datapipes.iter import FSSpecFileOpener, IterableWrapper, IterDataPipe
from torchdata.datapipes.iter.load.s3io import S3FileLoaderIterDataPipe


class S3Pipe(str, Enum):
    """
    Chooses which implementation of s3 reader to use.

    1. ``s3io``: `s3-plugin <https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/s3io.py>`_ (pybinded c++ client).
    1. ``fsspec``:  aioboto3 via `fsspec <https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/fsspec.py>`_

    .. note:: Use ``s3io`` for partitions larger than 2GB or when it is desirable to use less number of dataloader workers.
              Use ``fsspec`` for smaller partitions and when one can use more dataloader workers to attain higher throughput.

    """

    s3io = "s3io"
    fsspec = "fsspec"



MB: int = 1024 * 1024  # 1MB == 1024 * KB == 1024 * 1024 bytes

log: logging.Logger = logging.getLogger(__name__)


@functional_datapipe("read_files_from_s3_part_3")
class S3FileReader(IterDataPipe[str]):  # type: ignore[misc] # see mypy.ini for details
    """
    A convenience wrapper around the two types of S3 readers that are bundled with torchdata. Namely:

    1. `S3FileLoaderIterDataPipe <https://pytorch.org/data/beta/generated/torchdata.datapipes.iter.S3FileLoader.html#torchdata.datapipes.iter.S3FileLoader>`_
    1. `FSSpecFileOpener <https://pytorch.org/data/beta/generated/torchdata.datapipes.iter.FSSpecFileOpener.html#fsspecfileopener>`_

    One can choose which implementation to use to read files from s3.
    This iter-datapipe yields a tuple of ``(url, BytesIO)`` for each iterated ``url`` from the source datapipe.
    """

    def __init__(
        self,
        source: IterDataPipe[str],
        s3pipe: S3Pipe,
        buffer_size: int = 128 * MB,
    ) -> None:
        """

        Arguments:
            source: an iter-datapipe that yields s3 urls (e.g. ``s3://<bucket>/<key>``)
            s3pipe: ``s3io`` for ``S3FileLoaderIterDataPipe`` and ``fsspec`` for ``FSSPecFileOpener``.
        """
        self.source = source
        self.s3loader_type = s3pipe
        self.buffer_size = buffer_size

    def __iter__(self) -> Iterator[Tuple[str, StreamWrapper]]:
        for url in self.source:
            wrapped_url = IterableWrapper([url])
            if self.s3loader_type == S3Pipe.s3io:
                bucket = UrlPath(url).path.split("/")[0]
                resp = (
                    boto3.session.Session()
                    .client("s3")
                    .get_bucket_location(Bucket=bucket)
                )
                # null location constraint == us-east-1
                # see: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_bucket_location
                region = resp["LocationConstraint"] or "us-east-1"
                s3loader = S3FileLoaderIterDataPipe(
                    wrapped_url,
                    region=region,
                    multi_part_download=False,
                    buffer_size=self.buffer_size,
                )
            else:  # self.s3loader == S3Loader.fsspec
                s3loader = FSSpecFileOpener(
                    wrapped_url,
                    # DO NOT set cache_regions to True
                    # Otherwise it makes this datapipe non-threadsafe
                    # (e.g. not compatible with dataloader.num_workers>0)
#                    cache_regions=False,
                    default_cache_type="readahead",
                    default_block_size=self.buffer_size,
                )

            yield next(iter(s3loader))

    def __len__(self) -> int:
        return len(self.source)

In [4]:
from typing import Callable, Dict, Iterable, Iterator, List
from torchdata.datapipes.iter import IterDataPipe
from torchdata.datapipes.iter.load.fsspec import FSSpecFileListerIterDataPipe

DataPipeCtor = Callable[[str], IterDataPipe[str]]

_DEFAULT = FSSpecFileListerIterDataPipe

class ListDirs(IterDataPipe[str]):  # type: ignore[misc] # see mypy.ini for details
    """
    Lists filenames in the specified directory. The directory can be specified using a URI.

    Usage Example:

    .. doctest::

     >>> list(ListDirs(dirs=["file:///tmp"])) == list(ListDirs(dirs=["/tmp"]))
     True


    .. note:: The additional keyword arguments are passed directly to the listers for each scheme.

    Supported URIs:

    #. ``file:///<local_dir_path>`` or ``<local_dir_path>`` (URLs missing the scheme is interpreted as local dir)
    #. ``s3://<bucket>/<prefix>``
    #. ``datakit://<dataset_name>/<dataset_branch>`` (for datasets in M5 datahub - ``s3://m5-datasets-dev-us-east-1``)
    #. ``datakit://<dataset_name>/<dataset_branch>/<split_name>`` (e.g. ``datakit://foo/bar/train``)

    Examples:

    #. ``file:///home/kiuk/my_dataset_dir`` or equivalently ``/home/kiuk/my_dataset_dir``
    #. ``datakit://foo/bar/all``
    #. ``datakit://foo/bar/train``
    #. `datakit://foo/bar/validation``
    #. ``datakit://foo/bar/test``


    .. important:: Users outside of the M5 team, to use a custom s3 bucket (and not the one for m5ds hub)
                   with ``datakit://`` set the ``s3_bucket=<custom bucket name>`` as a keyword argument.
                   Example: ``ListDirs(["datakit://foo/bar"], s3_bucket="my-teams-dataset-bucketname")``
    """

    def __init__(self, dirs: Iterable[str], **kwargs: str) -> None:
        self.dir_urls: List[UrlPath] = [UrlPath(url) for url in dirs]

        self.files = []

        for url in self.dir_urls:
            # list dirs up front so that we can shard by file
            self.files.extend( #_LISTERS.get(url.scheme, _DEFAULT)(str(url), **kwargs))
                FSSpecFileListerIterDataPipe(str(url), **kwargs))

        # sort by filepath for stable sharding
        # since list APIs for filesystems are not always deterministic
        self.files.sort()

        self.instance_id = 0
        self.num_instances = 1

    def is_shardable(self) -> bool:
        return True


    def apply_sharding(self, num_instances: int, instance_id: int) -> None:
        assert instance_id < num_instances
        self.num_instances = num_instances
        self.instance_id = instance_id

        num_total_shards = len(self.files)
        assert num_total_shards >= num_instances, (
            f"total # files={self.files} in {self.dir_urls}"
            f" is less than the total # dataloader worker instances={num_instances}"
            f" either decrease the number of trainers and/or num dataloader workers"
            f" or use a larger dataset"
        )


    def __iter__(self) -> Iterator[str]:
        for idx, fileurl in enumerate(self.files):
            if idx % self.num_instances == self.instance_id:
                yield fileurl


In [3]:
import pandas as pd
import pyarrow

df = pyarrow.parquet.read_pandas('./data-parquet/Digital_Software')
df

AttributeError: module 'pyarrow' has no attribute 'parquet'

In [49]:
from torchdata.datapipes.iter import FileLister
import torcharrow.dtypes as dt

# marketplace: string
# customer_id: string
# review_id: string
# product_id: string
# product_parent: string
# product_title: string
# star_rating: int32
# helpful_votes: int32
# total_votes: int32
# vine: string
# verified_purchase: string
# review_headline: string
# review_body: string
# review_date: date32[day]
# year: int32

schema = dt.Struct([
    # dt.Field("marketplace", dt.string),
    # dt.Field("customer_id", dt.string),
    # dt.Field("review_id", dt.string),
    # dt.Field("product_id", dt.string),
    # dt.Field("product_parent", dt.string),
    # dt.Field("product_title", dt.string),
    # dt.Field("star_rating", dt.int32),
    # dt.Field("helpful_votes", dt.int32),     
    # dt.Field("total_votes", dt.int32),  
    # dt.Field("vine", dt.string),
    # dt.Field("verified_purchase", dt.string),
    # dt.Field("review_headline", dt.string),
    dt.Field("review_body", dt.string),
    # dt.Field("review_date", dt.int32),
    # dt.Field("year", dt.int32) 
])

source_dp = FileLister("./data-parquet/Digital_Software", masks="part*.parquet")
parquet_df_dp = source_dp.load_parquet_as_df(dtype=schema, columns=["review_body"])
list(parquet_df_dp)



KeyboardInterrupt: 

In [11]:
# from datasets import load_dataset
# dataset = load_dataset("parquet", data_files={'train': 's3://dsoaws/parquet'})

InvalidSchema: No connection adapters were found for 's3://dsoaws/parquet'

In [50]:
import torch

S3_BUCKET = "s3://dsoaws/data/train/part-algo-1-womens_clothing_ecommerce_reviews.csv"

#S3_BUCKET = "s3://dsoaws/parquet/"

#def get_dataset(cfg: Config, split: DataSplit) -> IterDataPipe[InputExample]:

from transformers import AutoTokenizer
    
model_checkpoint = "bigscience/bloom-560m"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)


text_column_name = "review_body"

def tokenize_function(examples):
    tokenized = tokenizer(examples[text_column_name])
    return tokenized



from typing import Dict, List, Tuple

InputExample = Tuple[Dict[str, torch.LongTensor], int]

def process_example(data: str) -> InputExample:
    print(type(data[4]))
    print(data[4])

    input_ids = tokenizer.encode(
        data[4],
        max_length=4096,
        truncation=True,
        padding="max_length",
    )

    input = {"input_ids": torch.LongTensor(input_ids)}

    return input


# examples = ListDirs([S3_BUCKET]) \
#               .read_files_from_s3_part_3(S3Pipe.fsspec) \
#               .parse_csv(skip_lines=1, as_tuple=True) \
#               .map(process_example)

examples = ListDirs([S3_BUCKET]) \
              .read_files_from_s3_part_3(S3Pipe.fsspec) \
              .load_parquet_as_df(dtype=schema, columns=["review_body"])

# .readbytes(return_path=False) \
#              .map(process_example)
              # .parse_csv(skip_lines=1, as_tuple=True) \
              # .map(process_example)

              # .readlines(return_path=False) \
              # .map(process_example)




In [51]:
print(next(iter(examples)))

TypeError: Cannot convert tuple to pyarrow.lib.NativeFile
This exception is thrown by __iter__ of ParquetDFLoaderIterDataPipe()

In [None]:
tokenized_dataset_train = dataset_train.map(tokenize_function, batched=True, num_proc=4, remove_columns=[
    'marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 
    'product_category', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 
    'review_headline', 'review_date', text_column_name])

In [None]:
tokenized_dataset_validation = dataset_validation.map(tokenize_function, batched=True, num_proc=4, remove_columns=[
    'marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 
    'product_category', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase', 
    'review_headline', 'review_date', text_column_name])

# Convert Raw Text to BERT Features using Hugging Face and TensorFlow

In [None]:
import tensorflow as tf
import collections
import json
import os
import pandas as pd
import csv
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

REVIEW_BODY_COLUMN = "review_body"
REVIEW_ID_COLUMN = "review_id"

LABEL_COLUMN = "star_rating"
LABEL_VALUES = [1, 2, 3, 4, 5]

label_map = {}
for (i, label) in enumerate(LABEL_VALUES):
    label_map[label] = i


class InputFeatures(object):
    """BERT feature vectors."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id, review_id, date, label):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id
        self.review_id = review_id
        self.date = date
        self.label = label


class Input(object):
    """A single training/test input for sequence classification."""

    def __init__(self, text, review_id, date, label=None):
        """Constructs an Input.
        Args:
          text: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
          label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.text = text
        self.review_id = review_id
        self.date = date
        self.label = label


def convert_input(the_input, max_seq_length):
    # First, we need to preprocess our data so that it matches the data BERT was trained on:
    # 1. Lowercase our text (if we're using a BERT lowercase model)
    # 2. Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])
    # 3. Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])
    #
    # Fortunately, the Transformers tokenizer does this for us!

    tokens = tokenizer.tokenize(the_input.text)
    tokens.insert(0, '[CLS]')
    tokens.append('[SEP]')
    print("**{} tokens**\n{}\n".format(len(tokens), tokens))

    encode_plus_tokens = tokenizer.encode_plus(
        the_input.text,
        padding='max_length', 
        max_length=max_seq_length,
        truncation=True
    )
    
    # The id from the pre-trained BERT vocabulary that represents the token.  (Padding of 0 will be used if the # of tokens is less than `max_seq_length`)
    input_ids = encode_plus_tokens["input_ids"]

    # Specifies which tokens BERT should pay attention to (0 or 1).  Padded `input_ids` will have 0 in each of these vector elements.
    input_mask = encode_plus_tokens["attention_mask"]

    # Segment ids are always 0 for single-sequence tasks such as text classification.  1 is used for two-sequence tasks such as question/answer and next sentence prediction.
    segment_ids = [0] * max_seq_length

    # Label for each training row (`star_rating` 1 through 5)
    label_id = label_map[the_input.label]

    features = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_id=label_id,
        review_id=the_input.review_id,
        date=the_input.date,
        label=the_input.label,
    )

    print("**{} input_ids**\n{}\n".format(len(features.input_ids), features.input_ids))
    print("**{} input_mask**\n{}\n".format(len(features.input_mask), features.input_mask))
    print("**{} segment_ids**\n{}\n".format(len(features.segment_ids), features.segment_ids))
    print("**label_id**\n{}\n".format(features.label_id))
    print("**review_id**\n{}\n".format(features.review_id))
    print("**date**\n{}\n".format(features.date))
    print("**label**\n{}\n".format(features.label))

    return features


# We'll need to transform our data into a format that BERT understands.
# - `text` is the text we want to classify, which in this case, is the `Request` field in our Dataframe.
# - `label` is the star_rating label (1, 2, 3, 4, 5) for our training input data
def transform_inputs_to_tfrecord(inputs, output_file, max_seq_length):
    records = []
    tf_record_writer = tf.io.TFRecordWriter(output_file)

    for (input_idx, the_input) in enumerate(inputs):
        if input_idx % 10000 == 0:
            print("Writing input {} of {}\n".format(input_idx, len(inputs)))

        features = convert_input(the_input, max_seq_length)

        all_features = collections.OrderedDict()

        # Create TFRecord With input_ids, input_mask, segment_ids, and label_ids
        all_features["input_ids"] = tf.train.Feature(int64_list=tf.train.Int64List(value=features.input_ids))
        all_features["input_mask"] = tf.train.Feature(int64_list=tf.train.Int64List(value=features.input_mask))
        all_features["segment_ids"] = tf.train.Feature(int64_list=tf.train.Int64List(value=features.segment_ids))
        all_features["label_ids"] = tf.train.Feature(int64_list=tf.train.Int64List(value=[features.label_id]))

        tf_record = tf.train.Example(features=tf.train.Features(feature=all_features))
        tf_record_writer.write(tf_record.SerializeToString())

        # Create Record For Feature Store With All Features
        records.append(
            {
                "input_ids": features.input_ids,
                "input_mask": features.input_mask,
                "segment_ids": features.segment_ids,
                "label_id": features.label_id,
                "review_id": the_input.review_id,
                "date": the_input.date,
                "label": features.label,
            }
        )

    tf_record_writer.close()

    return records

Three(3) feature vectors are created from each raw review (`review_body`) during the feature engineering phase to prepare for BERT processing:

* **`input_ids`**:  The id from the pre-trained BERT vocabulary that represents the token.  (Padding of 0 will be used if the # of tokens is less than `max_seq_length`)
    
* **`input_mask`**:  Specifies which tokens BERT should pay attention to (0 or 1).  Padded `input_ids` will have 0 in each of these vector elements.

* **`segment_ids`**:  Segment ids are always 0 for single-sequence tasks such as text classification.  1 is used for two-sequence tasks such as question/answer and next sentence prediction.

And one(1) label is created from each raw review (`star_rating`)  :

* **`label_id`**:  Label for each training row (`star_rating` 1 through 5)

# Demonstrate the BERT-specific Feature Engineering Step
While we are demonstrating this code with a small amount of data here in the notebook, we will soon scale this to much more data on a powerful SageMaker cluster.

## Feature Store requires an Event Time feature

We need a record identifier name and an event time feature name. This will match the column of the corresponding features in our data. 

Note: Event time date feature type provided Integral. Event time type should be either Fractional(Unix timestamp in seconds) or String (ISO-8601 format) type.

In [None]:
from datetime import datetime
from time import strftime

# timestamp = datetime.now().replace(microsecond=0).isoformat()
timestamp = datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
print(timestamp)

In [None]:
import pandas as pd

data = [
    [
        5,
        "ABCD12345",
        """I needed an "antivirus" application and know the quality of Norton products.  This was a no brainer for me and I am glad it was so simple to get.""",
    ],
    [
        3,
        "EFGH12345",
        """The problem with ElephantDrive is that it requires the use of Java. Since Java is notorious for security problems I haveit removed from all of my computers. What files I do have stored are photos.""",
    ],
    [
        1,
        "IJKL2345",
        """Terrible, none of my codes worked, and I can't uninstall it.  I think this product IS malware and viruses""",
    ],
]

df = pd.DataFrame(data, columns=["star_rating", "review_id", "review_body"])

# Use the InputExample class from BERT's run_classifier code to create examples from the data
inputs = df.apply(
    lambda x: Input(label=x[LABEL_COLUMN], text=x[REVIEW_BODY_COLUMN], review_id=x[REVIEW_ID_COLUMN], date=timestamp),
    axis=1,
)

In [None]:
# Make sure the date is in the correct ISO-8601 format for Feature Store
print(inputs[0].date)

## Save TFRecords

The three(3) features vectors and one(1) label are converted into a list of `TFRecord` instances (1 per each row of training data):
* **`tf_records`**:  Binary representation of each row of training data (3 features + 1 label)

These `TFRecord`s are the engineered features that we will use throughout the rest of the pipeline.

In [None]:
output_file = "./data-tfrecord-featurestore/data.tfrecord"

# Add Features to SageMaker Feature Store

## Create FeatureGroup

A feature group is a logical grouping of features, defined in the Feature Store, to describe records. A feature group definition is composed of a list of feature definitions, a record identifier name, and configurations for its online and offline store.

Create feature group, describe feature group, update feature groups, delete feature group and list feature groups APIs can be used to manage feature groups.


In [None]:
from time import gmtime, strftime, sleep

feature_group_name = "reviews-feature-group-" + strftime("%d-%H-%M-%S", gmtime())
print(feature_group_name)

In [None]:
from sagemaker.feature_store.feature_definition import (
    FeatureDefinition,
    FeatureTypeEnum,
)

feature_definitions = [
    FeatureDefinition(feature_name="input_ids", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="input_mask", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="segment_ids", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="label_id", feature_type=FeatureTypeEnum.INTEGRAL),
    FeatureDefinition(feature_name="review_id", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="date", feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name="label", feature_type=FeatureTypeEnum.INTEGRAL),
    FeatureDefinition(feature_name="split_type", feature_type=FeatureTypeEnum.STRING),
]

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup

feature_group = FeatureGroup(name=feature_group_name, feature_definitions=feature_definitions, sagemaker_session=sess)
print(feature_group)

## Specify `record identifier` and `event time` features

In [None]:
record_identifier_feature_name = "review_id"
event_time_feature_name = "date"

## Set S3 Prefix for Offline Feature Store

In [None]:
prefix = "reviews-feature-store-" + timestamp
print(prefix)

## Create Feature Group

The last step for creating the feature group is to use the `create` function. The online store is not created by default, so we must set this as `True` if we want to enable it. The `s3_uri` is the location of our offline store.

In [None]:
feature_group.create(
    s3_uri=f"s3://{bucket}/{prefix}",
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=role,
    enable_online_store=False,
)

## Describe the Feature Group

In [None]:
feature_group.describe()

## Review The Records To Ingest Into Feature Store

In [None]:
records = transform_inputs_to_tfrecord(inputs, output_file, max_seq_length)

# _IGNORE THE WARNING ^^ ABOVE ^^_

## Wait For The Feature Group Creation Complete

## _Note:  This may take a few minutes.  Please be patient._

Creating a feature group takes time as the data is loaded. We will need to wait until it is created before you can use it. You can check status using the following method.

In [None]:
import time


def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")

In [None]:
wait_for_feature_group_creation_complete(feature_group=feature_group)

# Ingest Records into Feature Store

After the FeatureGroups have been created, we can put data into the FeatureGroups by using the `PutRecord` API. 

This API can handle high TPS and is designed to be called by different streams. The data from all of these Put requests is buffered and written to S3 in chunks. 

The files will be written to the offline store within a few minutes of ingestion. To accelerate the ingestion process, we can specify multiple workers to do the job simultaneously. 

Use `put_record(...)` to put a single record in the FeatureGroup.

Use `ingest(...)` to ingest the content of a pandas DataFrame to Feature Store. You can set the `max_worker` to the number of threads to be created to work on different partitions of the `data_frame` in parallel.

In [None]:
import pandas as pd

df_records = pd.DataFrame.from_dict(records)
df_records["split_type"] = "train"
df_records

# Cast DataFrame `Object` to Supported Feature Store Data Type `String`

In [None]:
def cast_object_to_string(data_frame):
    for label in data_frame.columns:
        if data_frame.dtypes[label] == "object":
            data_frame[label] = data_frame[label].astype("str").astype("string")

In [None]:
%%time

cast_object_to_string(df_records)

feature_group.ingest(data_frame=df_records, max_workers=3, wait=True)

# Wait For Feature Store To Become Active
## _Note:  This may take a few minutes.  Please be patient._

In [None]:
feature_store_describe_response = feature_group.describe()

while "OfflineStoreStatus" not in feature_store_describe_response.keys():
    feature_store_describe_response = feature_group.describe()
    print("[INFO] Waiting for OfflineStore to be created.")
    # print(json.dumps(feature_store_describe_response, indent=4, sort_keys=True, default=str))
    sleep(120)

print("Offline store created.")

In [None]:
offline_store_status = None

while offline_store_status != 'Active':
    try:
        offline_store_status = feature_group.describe()['OfflineStoreStatus']['Status']
    except:
        pass
print('Offline store status: {}'.format(offline_store_status))

# Query the Feature Store

In [None]:
feature_store_query = feature_group.athena_query()

feature_store_table = feature_store_query.table_name

query_string = """
    SELECT 
        input_ids,
        input_mask,
        segment_ids, 
        label_id,
        review_id,
        date,
        label,
        split_type
    FROM "{}" 
    WHERE split_type='train' 
    LIMIT 3
""".format(feature_store_table)

print('Glue Catalog table name: {}'.format(feature_store_table))
print('Running query: {}'.format(query_string))

In [None]:
output_s3_uri = 's3://{}/query_results/{}/'.format(bucket, prefix)
print(output_s3_uri)

In [None]:
feature_store_query.run(
    query_string=query_string, 
    output_location=output_s3_uri
)

feature_store_query.wait()

In [None]:
import pandas as pd
pd.set_option("max_colwidth", 100)

df_feature_store = feature_store_query.as_dataframe()
df_feature_store

# Review the Feature Store

![Feature Store](img/feature_store_sm_extension.png)

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>