# Progress Report


**Group Name**: support nectar machines

**Team members**: David Hofer (Cyber Security), Frederieke Lohmann (Data Science), Arvid Ban (Computer Science), Yi-Yi Ly (Neuroinformatics)

## 1. Introduction

The aim of our hacking project is to create a **recommender system for investors**. This recommender system uses the data provided by the UBS Evidence Lab which tracks the popularity of brands on Instagram. Examples of the data collected are for example number of followers, pictures, videos, comments and likes.

Our recommender system personalizes the information learnt from the data by the following criteria of the investor:
1. Popularity prospects based on **current** versus **future predicted** metrics
1. Type of investment desired, i.e. **risk-taking** versus **conservative** investment.

## 2. Data Processing

### Exploratory Data Analysis

The dataset consists of 704'313 rows from  **706 brands** that were recorded from 2015-01-03 to 2023-09-16. The brands are grouped into 20 main competitive sets that vary in size from 1 brands to 164 brands.

![Figure caption](../yiyi/figs/main_comp_size.png)

### Data cleaning and preprocessing

> Provide a detailed account of the initial steps taken
to prepare the data for analysis. This should include a description of how data quality
issues, such as missing values or outliers, were addressed.

> Assumptions: Clearly articulate any assumptions that were made during the data
preparation phase.

In our preprocessing, we removed a large number of features that we deemed generally uninformative, as well as the two constant features `period` and `calculation_type`. Our primary focus was on modeling the social media user interaction, thus columns like `legal_entity_name`, `ultimate_parent_legal_entity_name`, `primary_exchange_name` did not provide any additional insights to us. Furthermore, there were lots of duplicate data rows with the only difference being the `compset` value, leading us to reject the feature as it did not seem to be informative. And finally, as our investigation of the `compset_group` feature found that most groups only contain a single or few brands and is thus not suitable for comparative analysis between in-group brands, we also left it out. We 
We experimented both with and without the `domicile_country_name` (which we previously cleaned by replacing certain noisy strings), but as it did not lead to a significant improvement of our methods, we opted towards simplicity and left it out too.
We also assumed that many of these aforementioned features are anyway strongly correlated among each other and with the brand, leading to dimishing returns in including them.

In the next step, we completely removed brands with too much missing data in one of the five numerical features `brand`, `followers`, `pictures`, `videos`, `comments`, and `likes`, based on a 70% threshold.

We also standardized all numerical features by subtracting the mean and dividing by the standard deviation. We then imputed the remaining values in the time series by using the forward-fill method, with the intention of not leaking future data. In the few cases where an initial datapoint was missing, we imputed it using the first value occuring in the sequence.

Lastly, we also created a train-test split for evaluation. For this, we grouped all the time series by brand, and for each brand cut off the last 20% of the datapoints as a test set.


> Feature engineering and data augmentation: Describe any techniques employed to
enhance the dataset, whether through the creation of new features or augmentation
of the existing data.

Our philosophy was to largely focus on the raw features and let our models learn directly from them. However, we experimented with various engineered advanced features, such as rolling averages, exponential moving averages, growth rates, rolling minimum, maximum, and standard deviation, as well as time lag features (older features from previous timesteps shifted forward). In the end, we decided to augment our features with the rolling min, max, and std as well as time lag features.

We also came up with custom user brand engagement metrics, namely the `engagement_rate_per_post`, which is calculated as $erpp = \frac{likes + comments}{followers * (pictures + videos)}$. This metric counts the number of interactions on a weekly basis with the brand, normalized by the frequency of posts as well as the size of the user base.





## 3. Modeling Approach

We pursued two different approaches for modeling the trends in the social media interaction dataset. 
Both are based on our custom metric, `engagement_rate_per_post`.

### Peak detection

The first approach uses `engagement_rate_per_post` as a surrogate for a brand's performance on social media, and therefore also as an indicator of emerging trends.
We aim to identify spikes in the metric of the brand over time, which indicate that the given brand is currently subject of a social media trend, and therefore potentially interesting to stakeholders.
Since a single spike might not indicate general trend potential of the brand, we use the 
peak rate $pr = \frac{\# \text{peaks}}{\# \text{weeks}} $  of a brand for identifying outliers amongst the brands. Only the datapoints not in the qth quantile of the peak rate distribution are considered anomalies and therefore interesting.

This parametrization via the quantile allows stakeholders to exert tight control over the precise number of outliers they wish to identify. The larger the quantile, the fewer outliers will be retrieved and the more extreme those retrieved will be.

The peaks are extracted using scipy's peak detection algorithm with a threshold at 30% of the maximum `engagement_rate_per_post` of the given brand.

\
\
**Algorithm**

The outlier detection algorithm works as follows:
```
q = 98
pr = []

for each brand:
    t <- 0.3 * max(engagement_rate_time_series)
    pr.append(calculate_peak_rate(engagement_rate_time_series, t))

outliers <- complement(quantile(pr, q))

return outliers
```
\
\
**Results**



![](peak_detect_plots.png)


### Anomaly detection on LSTM embedding 

**Anomaly detection**

As soon as we have an embedding of each brand already encodes the time in a single data point, we can then run classical anomaly detection algorithms on the time-independent latent space.

We considered the following methods:
* Isolation Forest
* K-Means clustering
* Local Outlier Factor

Since local outlier factor requires careful tuning of the neighborhood size, we decided against using this method.
For K-Means clustering, the model assumption is that the clusters are spherical. We used silhouette score to tune the number of clusters and found that this does not provide conclusive results for the optimal cluster size on the embeddings produced by the LSTM. We assume that this is due to the non-spherical geometric structure that we can also see when visually inspecting the latent space.

Thus, we rely on Isolation forest as our anomaly detection method, since it does not make any assumptions about the geometric structure and is rather stable with respect to its hyperparameter, the number of estimators used.
The results of predicting outliers in the latent space of the LSTM can be seen in the following PCA plot of the LSTM embedding. Blue indicate outliers and red indicate inliers as classified by the isolation forest.

![](./latent_if.png)
\
\
\
As of now, we've not been able to conclusively identify which factors of variation exactly the latent space encodes, but we were able to qualitatively verify that they neither correspond to the brand labelling nor the time point of the given test subsequence with respect relative to other test subsequences of the same brand.
We concludes this from the seemingly random distribution of these features in latent space.

![](./brands_latent.png)  ![](./time_latent.png)


We thus hypothesize that these latent embeddings encode some relevant degrees of freedom in the input features.
The outliers found with the isolation forest would then correspond to outliers based on these degrees of freedom.
Further analyses have to be conducted to quantify with respect to what features the identified datapoints can be considered outliers, and whether this is a useful proxy for brand relevance.

### Model enhancement