# Progress Report


**Group Name**: support nectar machines

**Team members**: David Hofer (Cyber Security), Frederieke Lohmann (Data Science), Arvid Ban (Computer Science), Yi-Yi Ly (Neuroinformatics)

## 1. Introduction

The aim of our hacking project is to create a **recommender system for investors**. This recommender system uses the data provided by the UBS Evidence Lab which tracks the popularity of brands on Instagram. Examples of the data collected are for example number of followers, pictures, videos, comments and likes.

Our recommender system personalizes the information learnt from the data by the following criteria of the investor:
1. Popularity prospects based on **current** versus **future predicted** metrics
1. Type of investment desired, i.e. **risk-taking** versus **conservative** investment.

## 2. Data Processing

### Exploratory Data Analysis

The dataset consists of 704'313 rows from  **706 brands** that were recorded from 2015-01-03 to 2023-09-16. The brands are grouped into 20 main competitive sets that vary in size from 1 brands to 164 brands.

![Figure caption](../yiyi/figs/main_comp_size.png)

### Data cleaning and preprocessing

> Provide a detailed account of the initial steps taken
to prepare the data for analysis. This should include a description of how data quality
issues, such as missing values or outliers, were addressed.

> Assumptions: Clearly articulate any assumptions that were made during the data
preparation phase.

In our preprocessing, we removed a large number of features that we deemed generally uninformative, as well as the two constant features `period` and `calculation_type`. Our primary focus was on modeling the social media user interaction, thus columns like `legal_entity_name`, `ultimate_parent_legal_entity_name`, `primary_exchange_name` did not provide any additional insights to us. Furthermore, there were lots of duplicate data rows with the only difference being the `compset` value, leading us to reject the feature as it did not seem to be informative. And finally, as our investigation of the `compset_group` feature found that most groups only contain a single or few brands and is thus not suitable for comparative analysis between in-group brands, we also left it out. We 
We experimented both with and without the `domicile_country_name` (which we previously cleaned by replacing certain noisy strings), but as it did not lead to a significant improvement of our methods, we opted towards simplicity and left it out too.
We also assumed that many of these aforementioned features are anyway strongly correlated among each other and with the brand, leading to dimishing returns in including them.

In the next step, we completely removed brands with too much missing data in one of the five numerical features `brand`, `followers`, `pictures`, `videos`, `comments`, and `likes`, based on a 70% threshold.

We also standardized all numerical features by subtracting the mean and dividing by the standard deviation. We then imputed the remaining values in the time series by using the forward-fill method, with the intention of not leaking future data. In the few cases where an initial datapoint was missing, we imputed it using the first value occuring in the sequence.

Lastly, we also created a train-test split for evaluation. For this, we grouped all the time series by brand, and for each brand cut off the last 20% of the datapoints as a test set.


> Feature engineering and data augmentation: Describe any techniques employed to
enhance the dataset, whether through the creation of new features or augmentation
of the existing data.

Our philosophy was to largely focus on the raw features and let our models learn directly from them. However, we experimented with various engineered advanced features, such as rolling averages, exponential moving averages, growth rates, rolling minimum, maximum, and standard deviation, as well as time lag features (older features from previous timesteps shifted forward). In the end, we decided to augment our features with the rolling min, max, and std as well as time lag features.

We also came up with custom user brand engagement metrics, namely the `engagement_rate_per_post`, which is calculated as $erpp = \frac{likes + comments}{followers * (pictures + videos)}$. This metric counts the number of interactions on a weekly basis with the brand, normalized by the frequency of posts as well as the size of the user base.





## 3. Modeling Approach

We pursued two different approaches for modeling the trends in the social media interaction dataset. The first one

### Model enhancement