### semantic aggregation

The following code shows a commmon issue with current encoders. A piece of text is composed of separable semantic components (commonly know as tags/labels).
Ex. "A fantasy-shooter game that allows for co-op multiplayer, and woman lead" -> fantasy, shooter, co-op multiplayer, woman lead

However, current encoders, while maintaining intact the overall semantic structure of the corpus, are unable to segregate between different semantic values.
Simply, when performing a semantic search, the samples are retrieved **considering all semantic components at once**: this can be defined as **search noise**.

In [3]:
import pandas as pd
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors
tqdm.pandas()

model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu') #all-MiniLM-L6-v2 #all-mpnet-base-v2

# encode tags
df_tags = pd.read_parquet('M.parquet')
df_tags['vector_tags'] = df_tags['tags'].progress_apply(lambda x : model.encode(x))

# search for closest tags
def find_similar(text):
	nbrs_tags = NearestNeighbors(n_neighbors=10, metric='cosine').fit(df_tags['vector_tags'].tolist())
	distances, indices = nbrs_tags.kneighbors([model.encode(text)])
	return df_tags['tags'].iloc[indices[0]].tolist()

100%|██████████| 446/446 [00:07<00:00, 58.97it/s]


One clear example of this search:
"A fantasy-shooter game that allows for co-op multiplayer, and woman lead", it only returns **a list of single semantic component**.

In [4]:
find_similar('A fantasy-shooter game that allows for co-op multiplayer, and woman lead')

['MMORPG',
 'Action RPG',
 'RPG',
 'Multiplayer',
 'Party-Based RPG',
 'Strategy RPG',
 'Tactical RPG',
 'Co-op Campaign',
 'Massively Multiplayer',
 'Character Action Game']

As we can see, if we try to manually remove the found semantic compoent, by semantic similarity, we can find the closest one, and so on:

In [5]:
find_similar('A fantasy-shooter game, and woman lead')
# -> Shooter

['Third-Person Shooter',
 'Shooter',
 'Strategy RPG',
 'MMORPG',
 'RPG',
 'Fantasy',
 'Action RPG',
 'Arena Shooter',
 'Tactical RPG',
 'RPGMaker']

In [6]:
find_similar('A fantasy game, and woman lead')
# -> Fantasy

['Fantasy',
 'Strategy RPG',
 'Dark Fantasy',
 'RPG',
 'Female Protagonist',
 'Action RPG',
 'Character Action Game',
 'MMORPG',
 'RPGMaker',
 'Party-Based RPG']

In [7]:
find_similar('woman lead')
# -> Female Protagonist

['Female Protagonist',
 'Third Person',
 'Runner',
 'Third-Person Shooter',
 'Turn-Based Tactics',
 'Turn-Based',
 'Boss Rush',
 'First-Person',
 'RPGMaker',
 'Action']