In [1]:
!pip install sentence-transformers -q --root-user-action=ignore

import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

# clustering + sentence embeddings
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# 1. The data

#### Load relevant data files

In [2]:
# specify path
path = ""

# load in data files
activity = pd.read_excel(path + 'activity.xlsx', usecols=['id', 'name', 'description', 'status']).rename(columns = {'id' : 'activity_id'})
activity_participant = pd.read_csv(path + 'activity_participant.csv', delimiter = ';')
user = pd.read_excel(path + 'user.xlsx', usecols=['id', 'status'])

#### Prepare the data

From the `user` data frame, we only need to know which user_id's are actually active.

In [3]:
# rename id
user = user.rename(columns = {'id' : 'user_id'})

# only active accounts
user_df = user[user.status == 'NORMAL']

# get list of active user_ids 
active_users = user_df['user_id'].values

# 2. The analysis

## Sentence embeddings

For a machine to "understand" text, the text must be translated into a language that the machine can interpret, i.e., `numbers`. However, straightforward techniques such as mere `word counts` ignore the context in which words are used. While such techniques may work well for certain use cases, they fail to capture the semantic meaning of language and can lead to imprecise results or even misinterpretation (e.g., consider preceding negation words). Instead, `embeddings` use information from the words adjacent to a word and thereby, allow to also capture its context. On a word or sentence level, embeddings can be used to build a `vector representation` (basically a list of numbers) of the semantics and syntax of a word/sentence.

Fortunately, several pre-trained models are freely available and allow anyone to create word/phrase embeddings from their own text input. These models are trained on large corpora of data and thus learn which words have a similar meaning or are used in similar contexts. These words are therefore mapped more closely to each other, i.e. receive a more similar vector.

Here, using a pre-trained multilingual model, each activity description is mapped into a 768-dimensional vector space. In this vector space, similar activities will be placed closer to each other than different activities. This allows the calculation of `similarity` of activities, the `classification` of new activities, and the `recommendation` of such to user who participated in similar activities in the past.

### Step 1: Prepare list of strings (incl. event name and description) as input for the sentence embedding model

In [4]:
# extract relevant variables from original data file
act_embeddings = activity[['activity_id', 'name', 'description', 'status']]

# delete 'deleted' events (otherwise they are re-added from activity_interest and show up as na's)
act_embeddings = act_embeddings[act_embeddings.status != 'DELETED'].drop(columns=['status'])

# extract relevant columns into list
act_descriptions = act_embeddings[['name', 'description']].astype(str).values.tolist()

print("\nSentence embeddings will be created for", len(act_descriptions), "activity descriptions.")


Sentence embeddings will be created for 43036 activity descriptions.


### Step 2: Load and train the model

In [None]:
# load model for sentence embeddings
model = SentenceTransformer('sentence-transformers/stsb-xlm-r-multilingual')

# create sentence embeddings of entries in list of strings
embeddings = model.encode(act_descriptions)
embeddings_ls = list(embeddings)

# merge sentence embeddings to df with descriptions
act_embeddings['embeddings'] = embeddings_ls

The previous steps (loading and applying the sentence embedding model to own text input) can either be done as shown above or using a service/API that is available online.

$~$
### Step 3: Use the sentence embeddings

In step 1 and 2 we have generated a sentence embedding per past activity description. This allows for any new activity description to be matched with (1) the most similar past events and (2) potentially interested users (who went to those past events) who can be notified/recommended this new activity. As a further extension of this idea, this technique can also be used to match (3) like-minded users, as will be explained later.

To illustrate this workflow, four functions are first defined and in [3. The use case](#case) applied to an exemplary new activity description. The first two functions are required for personalised activity recommendations, the latter two functions are required for personalised friend suggestions.

$~$
#### Define function (1) that outputs the activity_id of most similar events

First, we specify a function which:
- takes as `input a list of the embedding of a new activity` description and a `dataframe with an "embeddings" column` holding all embeddings of all past activities.
- provides as `output a list of the activity_id's` of most similar activities.

In [6]:
# the default threshold is .8 but can be adjusted in the function call
def get_similar_activities(new_embedding, past_embeddings_df, threshold = .8):
    
    # get similarity between new sentence embedding and all existing embeddings
    past_embeddings_df['similarity'] = past_embeddings_df['embeddings'].map(lambda x: cosine_similarity(np.expand_dims(new_embedding, axis=0),
                                                                                                        np.expand_dims(x, axis=0)))
    
    # get all events that have a similarity of at least the threshold
    similar_act = past_embeddings_df[past_embeddings_df.similarity > threshold]
    
    # get activity_id of those similar events
    return similar_act['activity_id'].unique()

#### Define function (2) that outputs the user_ids of people who went to this event `for activity recommendations`

As a second step, we can specify a function which:
- takes as `input the previously created list` of the activity_id's of similar events (= output of function 1) and a `dataframe` holding past activities and their participants, i.e., the result of reading in the data file.
- provides as `output a list` of the user_id's of people who went to those similar events before.

In [7]:
def get_users(participants_df, similar_events_ls):
    
    # get all people that went to the similar events
    participants_df = participants_df[participants_df['activity_id'].isin(similar_events_ls)]
    return participants_df['user_id'].unique()

Assuming that people who went to given activities in the past would enjoy to go to a similar activity in the future, this information can be used to give personalised recommendations for new activities to users. To further personalise these recommendations, the `location` of the people should also be taken into account. For this, an average range can be applied or, ideally, a new feature can be introduced (at registration), where each user can specify a range of how far they want to travel for activities (or both).

To extend this idea, the `activity embeddings` can also be related to the users who actually participated in a given activity. Per user, the embeddings of all events in which they participated can be averaged in order to create a single `"person embedding"`. I.e., based on all the activities someone joined in the past, this averaged embedding nicely captures this person's interests. This embedding can then be used to further personalise the user experience by identifying `people with similar interests` and recommending them as friends.

$~$
#### Define function (3) that outputs average embeddings of people
- takes as `input a dataframe listing all users`, the activity_ids and corresponding embeddings of all activites they joined
- provides as `output a list of average embeddings` per person

In [8]:
def get_average_embeddings(participants_df):
    
    # split embedding vector into 768 separate columns
    participants_split = pd.DataFrame(participants_df.activity_embedding.tolist(), index= participants_df.user_id)
    
    # calculate average embedding per person
    average_embedding = participants_split.groupby(['user_id']).mean()
    
    #  merge 768 columns back into one embedding vector
    average_embedding['avg_embed'] = average_embedding.iloc[:, 0:768].values.tolist()
    return average_embedding[['avg_embed']]

#### Define function (4) that outputs all user_ids of people similar to user X `for friend suggestions`

- takes as `input the previously created dataframe` with average person embeddings (in a column called `avg_embed`, output of function 3), the `index` (from that same dataframe) of the user for which we want to find similar people, and the `threshold for similarity`. The function includes a default setting for both of the latter, however, this can be adjusted in the function call, if desired.
- provides as `output` a list of user_id's of people similar to user X.

In [9]:
def get_similar_people(average_embedding, user_x = 0, threshold = .9):
    
    # create matrix of all embeddings
    embed_mat = np.array([x for x in average_embedding['avg_embed']])

    # calculate cosine similarity between every pair
    sim_mat = cosine_similarity(embed_mat, embed_mat)
    
    # write similarity scores (similarity with this specific user x to df
    average_embedding['sim_score'] = sim_mat[user_x]
    
    # get array of all user_ids that have a similarity of at least .9 to the pre-specified person
    return average_embedding[average_embedding.sim_score > threshold].index.values.tolist()

<a id="intro"></a>
# 3. The use case

#### We will now apply all the functions to a new exemplary activity description to illustrate the workflow

As an example, a new activity description was drawn up:

> "Zin om samen te genieten van heerlijk eten en een goed glas wijn? In het laatste weekend van maart is er een evenement in Utrecht waar ze heerlijke hapjes en drankjes verzorgen. Je kan allerlei verschillende culturele gerechten proberen en de sfeer is altijd gezellig!! Kom je ook?"

This example description is used below to illustrate the workflow and potential of each newly added activity.

In [None]:
# new event description
example_event = "Zin om samen te genieten van heerlijk eten en een goed glas wijn? In het laatste weekend van maart is er een evenement in Utrecht waar ze heerlijke hapjes en drankjes verzorgen. Je kan allerlei verschillende culturele gerechten proberen en de sfeer is altijd gezellig!! Kom je ook?"

# get embedding for the new description
example_embedding = model.encode(example_event)

## 3.1 Personalised activity recommendations
### Apply the first function

This function outputs a list of activity_id's from past activities, which are similar to this newly created activity.

In [11]:
# apply first function and save output in object
similar_activities = get_similar_activities(example_embedding, act_embeddings, threshold = .8)

print("\nThere are", len(similar_activities), "activities which were similar to this newly created activity.")


There are 43 activities which were similar to this newly created activity.


### Apply the second function

This function outputs a list of user_id's who went to those past activities. Assuming that people who went to given activities in the past would enjoy to go to a similar activity in the future, this information can be used to recommend and/or alert exactly these users about the newly created activity. As mentioned before, the location of the user and the activity should be taken into account too, either using the average range or a new feature (at registration), where each user can specify a range of how far they want to travel for activities (or both).

In [12]:
# read in two columns from existing data file: user_ids and which activities they each participated in
participants = activity_participant[['activity_id', 'user_id']].copy()

# apply second function by using the output from function 1 and user data as new input
similar_participants = get_users(participants, similar_activities)

print("\nThere are", len(similar_participants), "people who went to those events and might be interested in this one too.")


There are 403 people who went to those events and might be interested in this one too.


### Applying these two functions already concludes the process of personalising activity recommendations.
### In the next step, we will extend the idea of sentence embeddings to the users themselves.

$~$
## 3.2 Personalised friend suggestions

As a first step, we need to prepare a `participant dataframe`, which holds all `activity_ids` and corresponding `embeddings` `per user`. The starting point for this is the same dataframe as was used in the previous step, i.e., a dataframe holding past activities and their participants, which was the result of simply reading in two columns from the data.

In [13]:
# get user_ids and which activities they participated in
participants = activity_participant[['activity_id', 'user_id']].copy()

# filter only active user
participants = participants[participants['user_id'].isin(active_users)]

# get all events per (active) user
participants = participants.groupby(['user_id'])['activity_id'].value_counts()
participants = participants.to_frame('count').reset_index()

We now create a dictionary that links each `activity_id` to their corresponding `embedding`. This allows us to easily map the embeddings to the participant's past activities: the participants dataframe only lists the activity_ids, using the dictinary we can easily add the corresponding embeddings.

In [None]:
# create dictionary of activity_ids and corresponding embeddings
embedding_dict = dict(list(zip(act_embeddings.activity_id, act_embeddings.embeddings)))

# use dictionary to map embedding to activity_id
participants['activity_embedding'] = participants['activity_id'].map(embedding_dict)

# drop na's (events which were deleted -after- they happened)
participants = participants.dropna()

print("\nThis results in a dataframe listing -all- activities and -all- corresponding embeddings per user:\n")
participants.head()

### Apply the third function

This dataframe can now be used as input for function 3, which will allow us to compute an average sentence embedding per person. Currently, the `activity_embedding` column consists of lists of 768 numbers - per cell. To calculate the average embedding per person, we need to split the values into 768 into separate cells, compute the column-wise average for all rows belonging to one user, and then merge the 768 separate cells back into 1, holding the average embedding vector per person, reflecting their interest based on the past activities they joined. Applying the function below will take care of that.

In [None]:
avg_embedding = get_average_embeddings(participants)

print("\nThis results in a dataframe listing the average embedding per user:\n")
avg_embedding.head()

### Apply the fourth function

In the last step, we use the person embeddings to find people who have similar interests based on their past activity.

The resulting number of user_id's suggested as friends for the user in question may seem rather high, but please consider:
- further `filter options` can/should be applied in a next step, such as matching location (taking into account the average travel range + self-indicated willingness to travel), matching age groups, matching interests, etc. This could significantly improve the recommendations and make them even more relevant.
- the `similarity threshold` is set to a default of .9, however, this can always be adjusted in both directions.

In [None]:
similar_users = get_similar_people(avg_embedding, user_x = 0, threshold = .9)

print("\nBased on previous activity, this person can be recommended a friendship with the following", len(similar_users), "people (aka user_id's):\n\n", similar_users)