In [79]:
%%capture
!pip install -U sentence-transformers
import json
import pickle
import time
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

## **Main Idea**

The objective is to find semantic matches of the encoded descriptions using a language model based on given test audience segments. Subsequently, the cosine similarity between these encoded descriptions and either instances of "label_name" or the merge of "label_name" + "segment_description" is calculated. The "label_name" column contains keywords that can be utilized to find semantic matches. Experiments showed that retaining only "label_name" yields more meaningful semantic information compared to merging columns.
As a preprocessing step aimed at saving time during inference, the embeddings of df["label_name"] are computed and stored using the [*'paraphrase-multilingual-MiniLM-L12-v2'*](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) model, which supports text in both English and German. This is a multilingual small Language Model that maps texts to a 384 dimensional dense vector space, can be used for tasks such as semantic search, and is suitable for cos-similarity.


Here, I experimented with various approaches to find an optimal one in terms of both runtime and performance for the API. Performance was evaluated by calculating the mean of cosine similarities between the test segments and the found matches.

In [95]:
# Load dataset
df = pd.read_csv("/content/source_segments.csv", encoding="latin", sep=';|,', engine="python")
print(df.info())
print(df.isna().sum())
# Load model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Create embeddings (fill NaN values with "")
embeddings = model.encode(list(df["label_name"].fillna("")))

# Save the embeddings
pickle.dump(embeddings, open("embeddings.p", "wb"))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1630 entries, 0 to 1629
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   label_id_long        1630 non-null   int64 
 1   label_id             1630 non-null   int64 
 2   parent_id            1630 non-null   int64 
 3   segment_description  1626 non-null   object
 4   label_name           1411 non-null   object
dtypes: int64(3), object(2)
memory usage: 63.8+ KB
None
label_id_long            0
label_id                 0
parent_id                0
segment_description      4
label_name             219
dtype: int64


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [96]:
# Load model
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
# model = SentenceTransformer('all-MiniLM-L6-v2')
# Read the embeddings
embeddings = pd.read_pickle("embeddings.p")
# Load dataset
df = pd.read_csv("/content/source_segments.csv", encoding="latin", sep=';|,', engine="python")
# Read the test audiences
with open("test_audiences.json","r") as file:
    test_audiences = json.load(file)
file.close()

### **Approach 1**

One approach to handling audience segments involves a brute force method: encoding each segment and comparing it to all embeddings from the source_list, then saving the best match. While this method may yield the best results, it's not time efficient.

*   Mean_Cos of all the test_audiences: 0.6026255935430527
*   Duration: 4.23s



In [100]:
start = time.time()
Mean_Cos = 0
for segment in test_audiences['test_audiences']:
  print(segment['description'])
  # Encode the test description
  y = model.encode(segment['description'])
  max, which = 0, None
  for i in range(len(df)):
      try:
        # Read the embeddings of the source_list and compute the cos_sim
        x = embeddings[i]
        sim = cosine_similarity([x], [y])[0][0]
        if sim>max:
          # Update the best result
          max = sim
          which = i
      except:
        continue
  Mean_Cos += max
  print('sim:', max)
  print(df.iloc[which])
  print()
# Mean cos of the best matches
print(f"Mean_Cos: {Mean_Cos/len(test_audiences['test_audiences'])}")
print(f"Duration: {round(time.time()-start, 2)}s")

PartnerSolutions > 208838 > Interest > Home & Garden > Home Appliances > Kitchenware
sim: 0.72592914
label_id_long                                                21408010000
label_id                                                         2140801
parent_id                                                          21408
segment_description                  interets related to home and garden
label_name             Interest | Home & Garden | Kitchen and Dining ...
Name: 681, dtype: object

PartnerSolutions > 208758 > Interest > Sports > Football > Bayern Munich
sim: 0.5689967
label_id_long                                        22340000000
label_id                                                   22340
parent_id                                                    223
segment_description    interest in consuming sports or equipment
label_name                            Interest | Sports | Soccer
Name: 862, dtype: object

Technology & Computing - MediaGroup DACH - Industrie 4.0 / Industry 4

### **Approach 2**

Upon examining the dataset and the label_ids, it becomes apparent that the dataset forms a tree structure, starting with general topics as the main roots and gradually narrowing down with each child and leaf. Hence, my second idea was to initially compare the encoded test segment with the label_names of the roots, select the best match, and then search its subtree for further matches. While this approach did result in an improvement in runtime, the outcomes were not promising.



*   Mean_Cos of all the test_audiences: 0.4710919503122568
*   Duration: 1.44s


In [102]:
start = time.time()
Mean_Cos = 0
for segment in test_audiences['test_audiences']:
  print(segment['description'])
  # Encode the test description
  y = model.encode(segment['description'])
  label_id, label_name = None, None
  max = 0
  df_first_root = df.loc[df['label_id_long']%10**10==0]
  for row in range(len(df_first_root)):
      try:
        # Read the embeddings of the roots and compute the cos_sim
        x = embeddings[row]
        sim = cosine_similarity([x], [y])[0][0]
        if sim>max:
          # Update the best result
          max = sim
          label_id = df_first_root.iloc[row]['label_id']
          label_name = df_first_root.iloc[row]['label_name']
      except:
        continue
  print('root:', label_name)
  df_temp = df.loc[df['label_name'].astype(str).str.contains(label_name)]
  max, which = 0, None
  for i in range(len(df_temp)):
      try:
        # Read the embeddings of the filtered source_list and compute the cos_sim
        x = embeddings[i]
        sim = cosine_similarity([x], [y])[0][0]
        if sim>max:
          # Update the best result
          max = sim
          which = i
      except:
        continue
  Mean_Cos += max
  print('sim:', max)
  print(df_temp.iloc[which])
  print()
# Mean cos of the best matches
print(f"Mean_Cos: {Mean_Cos/len(test_audiences['test_audiences'])}")
print(f"Duration: {round(time.time()-start, 2)}s")

PartnerSolutions > 208838 > Interest > Home & Garden > Home Appliances > Kitchenware
root: Psychographic
sim: 0.09360744
label_id_long                                         30101000000
label_id                                                    30101
parent_id                                                     301
segment_description    attitute towards consumption and nutrition
label_name                Psychographic | Consumption | Nutrition
Name: 1026, dtype: object

PartnerSolutions > 208758 > Interest > Sports > Football > Bayern Munich
root: Purchases & Consumption
sim: 0.4000638
label_id_long                                                50307010000
label_id                                                         5030701
parent_id                                                          50307
segment_description               Frequency of consuming food and drinks
label_name             Purchases & Consumption | Food & Drink | Gener...
Name: 1537, dtype: object

Technology &

### **Approach 3**

Therefore, rather than solely comparing the encoded test segment with the roots, my second idea involved comparing the encoded test segment with the averages of the embeddings of the children of each root. This is because the average embedding captures more semantic meaning of the subtree. Both the time efficiency and performance of this approach were promising, which ultimately led me to choose this method.



*   Mean_Cos of all the test_audiences: 0.5827115178108215
*   Duration: 1.95s


In [99]:
start = time.time()
Mean_Cos = 0
for segment in test_audiences['test_audiences']:
  print(segment['description'])
  # Encode the test description
  y = model.encode(segment['description'])
  label_id, label_name = None, None
  max = 0
  df_first_root = df.loc[df['label_id_long']%10**10==0]
  for row in range(len(df_first_root)):
      try:
        root = df_first_root.iloc[row]['label_name']
        x = np.mean(embeddings[list(df.loc[df['label_name'].astype(str).str.contains(root)].index)], axis=0)
        sim = cosine_similarity([x], [y])[0][0]
        if sim>max:
          max = sim
          label_id = df_first_root.iloc[row]['label_id']
          label_name = root
      except:
        continue
  print('root:', label_name)
  df_temp = df.loc[df['label_id'].astype(str).str[0]==str(label_id)]
  max, which = 0, None
  for i in (list(df_temp.index)):
      try:
        x = embeddings[i]
        sim = cosine_similarity([x], [y])[0][0]
        if sim>max:
          max = sim
          which = i
      except:
        continue
  Mean_Cos += max
  print('sim:', max)
  print(df_temp.loc[which])
  print()
# Mean cos of the best matches
print(f"Mean_Cos: {Mean_Cos/len(test_audiences['test_audiences'])}")
print(f"Duration: {round(time.time()-start, 2)}s")

PartnerSolutions > 208838 > Interest > Home & Garden > Home Appliances > Kitchenware
root: Purchase Intent
sim: 0.5681931
label_id_long                                                40602060800
label_id                                                       406020608
parent_id                                                        4060206
segment_description                 purchase intent for non-edible goods
label_name             Purchase Intent | Consumer Packaged Goods | No...
Name: 1336, dtype: object

PartnerSolutions > 208758 > Interest > Sports > Football > Bayern Munich
root: Interest
sim: 0.5689967
label_id_long                                        22340000000
label_id                                                   22340
parent_id                                                    223
segment_description    interest in consuming sports or equipment
label_name                            Interest | Sports | Soccer
Name: 862, dtype: object

Technology & Computing - MediaGr