Cascle's feed algorithm has tailored from 2 regimes i.e. App and Analytics which is undergo by some parts of data science processes. The database coloring represents location such as blue color stands for app-db
, while green color represents analytics-db
. This file will explain only Analytics regime with mild touch the App. Features are extracted in Feature Extractors i.e. content_stats_update and creator_stats_update to facilitating Model Training and Model Prediction of such coldstart_trainer, personalize_content_trainer and coldstart_predictor, per_ct_predictor, respectively.
Moreover, Aggregators are respond to logically filter contents from Inventory in different ways. Ranker consumes features from Feature Extractors , Trained model artifact from Model Training and aggregated/filtered contents from Aggregators to ranking/scoring tend of engagement for the user using Model Prediction then feeds a certain amount to user's UI. More precisely, user interaction or engagement will be recorded to database for purpose of further Analytics and model evaluation.
Data science workflow process of Castcle can be illustrated by the bottom diagram of the below figure exhibits overall workflow process interacts across databases i.e. blue blocks represent collections in app-db
and green blocks represents collections in analytics-db
databases in Mongodb Atlas, respectively. The bold arrows stand for presence of entity key relation between collections, the dot arrows reflect data extraction by either aggregation or calculation, and the two-headed arrows represents swap event.
In the other hands, the workflow process can be separated in to 3 steps which are shown as the top orange blocks in below diagram (orange arrows) corresponding in vertical axis to the bottom diagram. However, non-machine learning (ML)-involved processes will only carry 2 steps (purple arrow).
However, steps 1. and 2. are explained in castcle-trigger, are located in another repository i.e. castcle-trigger. Hence, this will explain only this repository on step 3.
This step possesses response for both schedule and on demand execution available for guest and registered users, respectively. There are also 2 processes in this step which are coldstart for guest users and personalize content for registered users. Another function to saving persistent is topic classify which is described in castcle-trigger,
- coldstart predictor consumes features from
analytics-db.contentStats
andanalytics-db.creatorStats
and model artifacts fromanalytics-db.mlArtifacts_country
to making model prediction every cron(*/30 * * * ? *). The output i.e. top 2000 content IDs and their scores for each country will be inserted intoapp-db.guestfeeditemtemp
. Whenever insertion finishs, the whole documents inapp-db.guestfeeditemtemp
will be swapped withapp-db.guestfeeditem
to support eliminate downtime prior to usage as feed items for guest users. - personalize content predictor takes request message of account ID and list of content IDs as input then consider to loading presence of the correspond artifact in
analytics-db.mlArtifacts
, if the account artifact is absent, the model will employ correspond geolocation artifact inanalytics-db.mlArtifacts_country
instead. In contrast, if the geolocation artifact is also absent, the model will employ default geolocation artifact which is "us". Therefore, the model also take features from bothanalytics-db.contentStats
andanalytics-db.creatorStats
then perform prediction resulting in scores for each input contents and sent back the output as message with status code "200"
In this section we will describe only collections those are interacted as output from data science processes. The data science-related collections can be separated into 2 groups depend on its location.
- collections in
analytics-db
contentStats
: prepares feature for model utilization in part of content and is obtained by extracting data fromapp-db.contents
. It updates itself every hour by removing contents which have age over threshold and upsert outputs from update content stats.creatorStats
: prepares feature for model utilization in part of content creator user and is obtained by extracting data fromapp-db.contents
. Similar tocontentStats
, it updates itself every hour but keep all content creator user using outputs from update creator stats.mlArtifacts
: collects personalize content model artifacts which are the output from personalize content trainer and also collect model version.mlArtifacts_country
: collects country-based model artifacts which are the output from coldstart trainer and also collect model version.topics
: master collection of topics that is gathered hiearachically; contain children and parents topics (if have) from outputs of topics classify. It updates whenapp-db.contents
has created or updated and topics can be classified.
- collections in
app-db
contentinfo
: duplicated collection amount withapp-db.contents
and updates whenapp-db.contents
has created or updated in the same process from topics classify. It collects both language and topics of the contents.guestfeeditemstemp
: It is output from coldstart predictor designed for eliminate downtime which will swap withguestfeeditems
after successfully when upserted and contains item type.guestfeeditems
: a collection that exists for utilize as feed to guest users which is swap fromguestfeeditemstemp
. It is output from coldstart predictor and also contains item type.
There are 2 repositories those involved with data science process i.e. castcle-trigger and castcle-ds-predict which have similar structure but different functionality. More precisely, one is response for part 1. and 2., while another is response for part 3. of workflow process, respectively.
In this section we will describe only collections those are interacted as output from data science processes which are located in this repository. For another data science-related collections click here.
- requirements.txt: contains necessary libraries.
- serverless.yml: contains configuration.
- python caller files (.py): responses for calling main function in modules,
- coldstart_predictor.py: responses for calling to execute coldstart_predictor.py to update
app-db.guestfeeditemstemp
andapp-db.guestfeeditems
. - per_ct_predictor.py: responses for calling to execute personalize_content_predictor.py. In contrast to
coldstart_predictor.py
, the output is returned as message response.
- modules python files (.py): contain main function files which are located inside sub-folders for operate with correspond python caller file.
This model will be used to ranking/scoring within threshold contents based on engagement behavior in each specified country. The model will be re-trained everyday then stored in analytics-db.mlArtifacts_country
collection. These models support users those do not have their own personalized model and can be used to give a wider range of content recommendation.
- Model inputs
- Country engagement
- Content features
- Country code (250 countries, ISO3166)
- Model detail
- Model: XGBOOST Regression Model
- Target variable: Weight engagement
- Time using: 1 mins (12/13/2021)
- Countries that do not have training data will adopt "us" model
- Output
- Collection contains "countryCode", model artifacts, and timestamp
-
Model workflow This file explain only model prediction section a.k.a. scoring section. If you would like to see another section, click here 4.1. Content features preparation
- Engagement List
- Like
- Comment
- Quote
- Recast
-
Aggregation : Sum
-
Group By: "countryCode" (ISO3166), "contentId"
4.2. Load model artifacts based on country
4.3 Save top 2000 contents Output list
- "content"
- "score"
- "countryCode"
- "type"
- "updatedAt"
- "createdAt"
This model will be used to rank requested contents based on user’s engagement behaviors. The model will be re-trained everyday in the morning and stored in the mlArtifact collection in db_analytics. The model is for users that have their own personalized model meaning that they have at least one engagement history and can be used to give a wider range of content recommendation combined with cold start model.
This model will be used to ranking/scoring the requested contents based on user’s engagement behaviors. The model will be re-trained everyday then stored in analytics-db.mlArtifacts
. These models support users that have their own personalized model meaning that they have at least one engagement history and can be used to give a wider range of content recommendation combined with cold start model.
- Model inputs
- User engagement
- Content features
- Model detail
- Model Used: XGBOOST Regression Model
- Target variable : Weight engagement
- Time using : 1 mins (12/13/2021)
- Output
- Collection contains "userId", model artifacts, and timestamp
- Model workflow (Scoring Section)
4.1 Content features preparation
1. Content Feature List
- likeCount : Total like of each content based on subject
- commentCount : Total comment of each content based on subject
- recastCount : Total recast of each content based on subject
- quoteCount : Total quote of each content based on subject
- photoCount : Total photo of each content
- characterLength : Number of charecter
- creatorContentCount : Total content of the creator of this content
- creatorLikedCount : Total like of the creator of this content
- creatorCommentedCount : Total comment of the creator of this content
- creatorRecastedCount : Total recast of the creator of this content
- creatorQuotedCount : Total quote of the creator of this content
- ageScore : age score of this content
2. Aggregation : Sum, Count
4.2.Load ML artifacts based on userId
4.3.Scoring given contents based on user’s model: Send content score to the Castcle app.