# Feature-Rich Recommender Systems

Interaction data is the most basic indication of users' preferences and interests. It plays a critical role in former introduced models. Yet, interaction data is usually extremely sparse and can be noisy at times. To address this issue, we can integrate side information such as features of items, profiles of users, and even in which context that the interaction occurred into the recommendation model. Utilizing these features are helpful in making recommendations in that these features can be an effective predictor of users interests especially when interaction data is lacking. As such, it is essential for recommendation models also have the capability to deal with those features and give the model some content/context awareness. To demonstrate this type of recommendation models, we introduce another task on click-through rate (CTR) for online advertisement recommendations :cite:`McMahan.Holt.Sculley.ea.2013` and present an anonymous advertising data. Targeted advertisement services have attracted widespread attention and are often framed as recommendation engines. Recommending advertisements that match users' personal taste and interest is important for click-through rate improvement.


Digital marketers use online advertising to display advertisements to customers. Click-through rate is a metric that measures the number of clicks advertisers receive on their ads per number of impressions and it is expressed as a percentage calculated with the formula: 

$$ \text{CTR} = \frac{\#\text{Clicks}} {\#\text{Impressions}} \times 100 \% .$$

Click-through rate is an important signal that indicates the effectiveness of prediction algorithms. Click-through rate prediction is a task of predicting the likelihood that something on a website will be clicked. Models on CTR prediction can not only be employed in targeted advertising systems but also in general item (e.g., movies, news, products) recommender systems, email campaigns, and even search engines. It is also closely related to user satisfaction, conversion rate, and can be helpful in setting campaign goals as it can help advertisers to set realistic expectations.


In [None]:
%maven ai.djl:api:0.8.0
%maven ai.djl:basicdataset:0.8.0
%maven ai.djl:model-zoo:0.8.0
%maven ai.djl.mxnet:mxnet-engine:0.8.0
%maven org.slf4j:slf4j-api:1.7.26
%maven org.slf4j:slf4j-simple:1.7.26
%maven net.java.dev.jna:jna:5.3.0

In [None]:
%maven ai.djl.mxnet:mxnet-native-auto:1.7.0-backport

## An Online Advertising Dataset

With the considerable advancements of Internet and mobile technology, online advertising has become an important income resource and generates vast majority of revenue in the Internet industry. It is important to display relevant advertisements or advertisements that pique users' interests so that casual visitors can be converted into paying customers. The dataset we introduced is an online advertising dataset. It consists of 34 fields, with the first column representing the target variable that indicates if an ad was clicked (1) or not (0). All the other columns are categorical features. The columns might represent the advertisement id, site or application id, device id, time, user profiles and so on. The real semantics of the features are undisclosed due to anonymization and privacy concern.

The following code downloads the dataset from our server and saves it into the local data folder.


In [None]:
import java.net.*;

import ai.djl.training.util.*;
import ai.djl.util.*;
import java.nio.file.*;


InputStream input = new URL("http://d2l-data.s3-accelerate.amazonaws.com/ctr.zip").openStream();
ZipUtils.unzip(input, Paths.get("./"));

There are a training set and a test set, consisting of 15000 and 3000 samples/lines, respectively.

## Dataset Wrapper

For the convenience of data loading, we implement a `CTRDataset` which loads the advertising dataset from the CSV file and can be used by `DataLoader`.


In [None]:
import ai.djl.engine.Engine;
import ai.djl.ndarray.NDArray;
import ai.djl.ndarray.NDList;
import ai.djl.ndarray.NDManager;
import ai.djl.training.dataset.ArrayDataset;
import ai.djl.training.dataset.Record;
import ai.djl.util.Progress;
import com.google.gson.Gson;

import java.io.BufferedReader;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class CTRDataset extends ArrayDataset {

    private boolean prepared;
    private NDManager manager = Engine.getInstance().newBaseManager();
    private List<Long[]> oneHotFeatures;
    private List<Float> labelList;

    private CTRDataset(Builder builder) {
        super(builder);
        this.oneHotFeatures = builder.oneHotFeatures;
        this.labelList = builder.label;
    }

    @Override
    public void prepare(Progress progress) throws IOException {
        if (prepared) {
            return;
        }
        data = new NDArray[oneHotFeatures.size()];
        labels = new NDArray[labelList.size()];
        for (int i = 0; i < data.length; i++) {
            data[i] = manager.create(Arrays.stream(oneHotFeatures.get(i)).mapToLong(Long::longValue).toArray());
            labels[i] = manager.create(labelList.get(i));
        }
        prepared = true;
    }

    /**
     * {@inheritDoc}
     */
    @Override
    public Record get(NDManager manager, long index) {
        NDList datum = new NDList();
        NDList label = new NDList();

        datum.add(data[(int) index]);
        if (labels != null) {
            label.add(labels[(int) index]);
        }
        datum.attach(manager);
        label.attach(manager);
        return new Record(datum, label);
    }


    public static Builder builder() {
        return new Builder();
    }

    public static final class Builder extends BaseBuilder<Builder> {

        private long numFeatures;
        private long featureThreshold;
        private String fileName;
        // feature id, category String, category code
        private Map<Long, Map<String, Long>> featureMap = new ConcurrentHashMap<>();
        // feature id, category String, category count
        private Map<Long, Map<String, Long>> featureCount = new ConcurrentHashMap<>();
        private Map<Long, Long> defaultValues = new ConcurrentHashMap<>();
        private List<String[]> features = new ArrayList<>();
        private List<Float> label = new ArrayList<>();
        private Long[] fieldDim;
        private Long[] offset;
        private List<Long[]> oneHotFeatures = new ArrayList<>();
        private String outputDir;

        Builder() {
        }

        @Override
        protected Builder self() {
            return this;
        }

        public Builder setFileName(String fileName) {
            this.fileName = fileName;
            return this;
        }

        public Builder optNumFeatures(long numFeatures) {
            this.numFeatures = numFeatures;
            return this;
        }

        public Builder optFeatureThreshold(long featureThreshold) {
            this.featureThreshold = featureThreshold;
            return this;
        }

        public Builder optMapOutputDir(String outputDir) {
            this.outputDir = outputDir;
            return this;
        }

        CTRDataset build() throws IOException {

            try (BufferedReader reader = Files.newBufferedReader(Paths.get(this.fileName))) {
                String line;
                while ((line = reader.readLine()) != null) {
                    String[] record = line.trim().split("\t");
                    if (record.length != this.numFeatures + 1) {
                        continue;
                    }
                    label.add(Float.parseFloat(record[0]));
                    for (int i = 1; i < numFeatures + 1; i++) {
                        Map<String, Long> count = featureCount.computeIfAbsent((long) i, k -> new ConcurrentHashMap<>());
                        // increment count for this category string
                        count.merge(record[i], 1L, Long::sum);
                    }
                    features.add(Arrays.copyOfRange(record, 1, record.length));
                }
            }
            fieldDim = new Long[(int) numFeatures];
            offset = new Long[(int) numFeatures];
            // reduce less frequent class
            for (long i = 1L; i < numFeatures + 1; i++) {
                featureCount.get(i).values().removeIf(value -> value < featureThreshold);
                Map<String, Long> reducedFeatures = featureCount.get(i);
                Map<String, Long> featureIndex = new ConcurrentHashMap<>();
                long index = 0;
                for (String feature : reducedFeatures.keySet()) {
                    featureIndex.put(feature, index);
                    index++;
                }
                featureMap.put(i, featureIndex);
                defaultValues.put(i, (long) featureIndex.size());
                fieldDim[(int) i - 1] = (long) featureIndex.size() - 1;
            }
            long sum = 0;
            for (int i = 0; i < fieldDim.length; i++) {
                offset[i] = sum;
                sum += fieldDim[i];
            }

            for (String[] feature : features) {
                Long[] oneHot = new Long[feature.length];
                for (int i = 0; i < oneHot.length; i++) {
                    oneHot[i] = featureMap.get((long) i + 1).getOrDefault(feature[i], defaultValues.get((long) i + 1)) + offset[i];
                }
                oneHotFeatures.add(oneHot);
            }
            // save feature map and default values for inference
            if (outputDir != null) {
                saveMap(featureMap, outputDir, "feature_map.json");
                saveMap(defaultValues, outputDir, "defaults.json");
            }

            return new CTRDataset(this);
        }

        private void saveMap(Map map, String outputDir, String fileName) throws IOException {
            Gson gson = new Gson();
            FileWriter writer = new FileWriter(outputDir + "/" + fileName);
            gson.toJson(map, writer);
            writer.flush();
            writer.close();
        }

    }
}

The following example loads the training data and print out the first record. We also need to save the feature map and default values for inference.


In [None]:
CTRDataset data = CTRDataset.builder()
                .optFeatureThreshold(4)
                .optNumFeatures(34)
                .setFileName("./ctr/train.csv")
                .optMapOutputDir("./")
                .setSampling(1, true)
                .build();
data.prepare();
NDManager manager = NDManager.newBaseManager();
Record record = data.get(manager, 0);
System.out.println(record.getData().singletonOrThrow());
System.out.println(record.getLabels().singletonOrThrow());

As can be seen, all the 34 fields are categorical features. Each value represents the one-hot index of the corresponding entry. The label $0$ means that it is not clicked. This `CTRDataset` can also be used to load other datasets such as the Criteo display advertising challenge [Dataset](https://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/) and the Avazu click-through rate prediction [Dataset](https://www.kaggle.com/c/avazu-ctr-prediction).  

## Summary 
* Click-through rate is an important metric that is used to measure the effectiveness of advertising systems and recommender systems.
* Click-through rate prediction is usually converted to a binary classification problem. The target is to predict whether an ad/item will be clicked or not based on given features.

## Exercises

* Can you load the Criteo and Avazu dataset with the provided `CTRDataset`. It is worth noting that the Criteo dataset consisting of real-valued features so you may have to revise the code a bit.


[Discussions](https://discuss.d2l.ai/t/405)
