In [1]:
import os
os.chdir('..')

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Install libraries to create user and item features
!pip install -q selective;
!pip install -q textwiser;
!pip install -q seq2pat;

# Feature Engineering

* The goal of this notebook is to show examples of how to create item and user features. 
* Mab2Rec is _independent_ of item, user, and interaction data used in recommendations and assumes that input data is created before building recommenders. 
* Sample input is given in `data/` which includes user features in `features_user.csv` and item features in `features_item.csv`.
* This notebook shows examples of how to create user or item features from **structured**, **unstructured**, and **sequential** data.
* In addition to techniques covered here, and you can utilize any other source to create your input data. 
* An overview of these libraries is [presented at All Things Open 2021](https://www.youtube.com/watch?v=54d_YUalvOA).

# Table of Contents

1. [Structured Data via Selective](#Structured-Data-via-Selective)
2. [Unstructured Data via TextWiser](#Unstructured-Data-via-TextWiser)
3. [Sequential Data via Seq2Pat](#Sequential-Data-via-Seq2Pat)   

# Structured Data via Selective

* The most common data source is structured, tabular data. 
* In recommenders, the typical usage of structured data is to represent **user features**.
* When there are many user features to consider, feature selection can decide which features to include in the user context.
* For feature selection, you can leverage [Selective](https://github.com/fidelity/selective).
* Selective provides an easy-to-use API for supervised and unsupervised feature selection methods.
* In unsupervised fashion, given a set of users, important features can be identified according to variance, correlation and statistical measures. 
* In supervised fashion, given a set of users _and_ the interaction label (e.g., click on _any item_), important features can be identified according to a linear or non-linear model. 
* Let's explore a quick start example. 

In [3]:
# Import Selective and SelectionMethod
from sklearn.datasets import load_boston
from feature.utils import get_data_label
from feature.selector import Selective, SelectionMethod

# Data
data, label = get_data_label(load_boston())

# Feature selectors from simple to more complex
selector = Selective(SelectionMethod.Variance(threshold=0.0))
selector = Selective(SelectionMethod.Correlation(threshold=0.5, method="pearson"))
selector = Selective(SelectionMethod.Statistical(num_features=3, method="anova"))
selector = Selective(SelectionMethod.Linear(num_features=3, regularization="none"))
selector = Selective(SelectionMethod.TreeBased(num_features=3))

# Feature reduction
subset = selector.fit_transform(data, label)
print("Reduction:", list(subset.columns))
print("Scores:", list(selector.get_absolute_scores()))

Reduction: ['RM', 'DIS', 'LSTAT']
Scores: [0.03785112185484013, 0.0009506038523461075, 0.005597120928343709, 0.0006564025010623736, 0.02400336475058042, 0.4385965510576283, 0.01419565112974908, 0.06486214843307006, 0.00430519230815954, 0.014455710682486108, 0.016132316878544155, 0.0107710325703614, 0.3676227830528288]


* In this example, we show 5 different `selector` methods. 
* Any selection approach can be used to `fit_transform` the dataset. 
* A more robust approach is to apply different selectors, and then to select feautures that are deemed important by several selectors. 
* It is even better to repeat this within cross-validation to make sure the selection is stable. 
* Selective offers a benchmarking utility to achieve this. 
* See [Selective Benchmarking](https://github.com/fidelity/selective#benchmarking).

# Unstructured Data via TextWiser

* Unstructured data is another common data source utilizing text, audio, and video features.   
* In recommenders, the typical usage of unstructured data is to represent **item features**.
* Unstructured data should first be featurized before consumption in recommenders.  
* For text data, you can leverage [TextWiser](https://github.com/fidelity/textwiser) to create text embeddings of item representations.
* TextWiser ([AAAI'21](https://ojs.aaai.org/index.php/AAAI/article/view/17814)) provides an easy-to-use API for a rich set of text featurization methods and their transformation while taking advantage of state-of-the-art pretrained NLP models.
* Let's explore a quick start example. 

In [4]:
# Conceptually, TextWiser is composed of an Embedding, potentially with a pretrained model,
# that can be chained into zero or more Transformations
from textwiser import TextWiser, Embedding, Transformation, WordOptions, PoolOptions
from transformers import logging
logging.set_verbosity_error()

# Data
documents = ["Some document", "More documents. Including multi-sentence documents."]

# Model: TFIDF `min_df` parameter gets passed to sklearn automatically
emb = TextWiser(Embedding.TfIdf(min_df=1))

# Model: TFIDF followed with an NMF + SVD
emb = TextWiser(Embedding.TfIdf(min_df=1), [Transformation.NMF(n_components=30), Transformation.SVD(n_components=10)])

# Model: Word2Vec with no pretraining that learns from the input data
emb = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained=None), Transformation.Pool(pool_option=PoolOptions.min))

# Model: BERT with the pretrained bert embedding
emb = TextWiser(Embedding.Word(word_option=WordOptions.bert), Transformation.Pool(pool_option=PoolOptions.first))

# Features
vecs = emb.fit_transform(documents)

print(documents)
print(vecs)

['Some document', 'More documents. Including multi-sentence documents.']
[[-0.19813849  0.25826398 -0.16164416 ... -0.18344447 -0.04831381
   0.27247164]
 [-0.38283885 -0.03924329 -0.10620081 ... -0.25401732  0.21510349
   0.4452555 ]]


* In this example, we show different embeddings from simple `TFIDF` to more complex `BERT`. 
* Notice how the `Embedding` can be followed with one more transformation operations, such as `NMF` or `SVD`. 
* In general, the `Transformation` reduces the dimensionality of the text representation to create succint embeddings. 
* Running `fit_transform` on the documents return the embedding of each document. 
* Checkout different word options, pre-trained models, and other transformations.
* See the rich list of [TextWiser Embeddings](https://github.com/fidelity/textwiser#available-embeddings).

# Sequential Data via Seq2Pat

* Time horizon is another source of information to build advanced recommenders.
* In recommenders, the typical usage of sequential data is to capture the behaviour of a user over time as part of **user features**. 
* For sequential data, you can leverage [Seq2Pat](https://github.com/fidelity/seq2pat).
* Seq2Pat ([AAAI'22](https://aaai.org/Conferences/AAAI-22/)) provides an easy-to-use API for frequent pattern mining in sequential datasets. 
* First, we find frequent patterns. Then, each user is transformed into a one-hot vector denoting the existence of frequent patterns in their sequential behaviour. This representation can be used as their user features.
* Let's explore a quick start example. 

In [5]:
# Example to show how to find frequent sequential patterns
# from a given sequence database subject to constraints
# and generate one-hot encodings as the features of each user
from sequential.seq2pat import Seq2Pat, Attribute
from sequential.dpm import get_one_hot_encodings

# Sequences and attributes data
sequences = [["A", "A", "B", "A", "D"],
             ["C", "B", "A"],
             ["C", "A", "C", "D"]]

values = [[5, 5, 3, 8, 2],
          [1, 3, 3],
          [4, 5, 2, 1]]

# Seq2Pat over 3 sequences
seq2pat = Seq2Pat(sequences=sequences)

# Price attribute corresponding to each item
price = Attribute(values=values)

# Average price constraint
seq2pat.add_constraint(3 <= price.average() <= 4)

# Patterns that occur at least twice (A-D)
patterns = seq2pat.get_patterns(min_frequency=2)
print("Frequent Patterns: ", patterns)

# Encodings that are generated using the minded pattern (A-D)
encodings = get_one_hot_encodings(sequences, patterns)
print("Encodings:\n", encodings)

Frequent Patterns:  [['A', 'D', 2]]
Encodings:
           sequence  feature_0
0  [A, A, B, A, D]          1
1        [C, B, A]          0
2     [C, A, C, D]          1


* In this example, we have 3 users with certain sequential events, e.g. page visits in-order. 
* Notice, how the length of each sequence in the sequence database is different.
* We can mine for frequent patterns in this sequence database while setting a `min_frequency` threshold to denote the minimum number of occurence of a pattern to be considered frequent. 
* More importantly, we consider **attributes** that correspond to each sequential event. For example, the price of the item in each page visit. 
* Then, we add **constraints** to reason about attributes, here, the average price to be between 3 and 4. 
* Pattern mining operates on the sequence database and seeks frequent patterns while satisfying constraints and the minimum frequency threshold. 
* See [Seq2Pat Constraints](https://github.com/fidelity/seq2pat/blob/master/notebooks/usage_example.ipynb).
* Finally, we create one-hot encodings of each user with the mined pattern (A-D), indicating the existence of such pattern in a sequence.
* See [Dichotomic Pattern Mining](https://github.com/fidelity/seq2pat/blob/master/notebooks/dichotomic_pattern_mining.ipynb) (DPM) as a integrator technology between Sequential Pattern Mining and downstream modeling tasks via Seq2Pat.