# Customer2Vec

This notebook is intended to illustrate the developed modules.

### Import UGTD

In [264]:
import warnings
warnings.filterwarnings('ignore')

# select path 
path_to_ugtd = 'C:/Users/Fabia/DataspellProjects/Data Generating Process'

# import data
import pandas as pd
comments = pd.read_excel('comments_final_2005.xlsx').drop('Unnamed: 0', axis = 1)

The raw version of UGTD includes the complete set of data scraped by the two modules for Facebook and Instagram. The fitted Doc2Vec models are trained on disclosing any comments which consists of less then 3 words.

UGTD per company

In [265]:
comments.groupby('company').count()['text'].sort_values(ascending = False)

company
meijer                   155959
Birkenstock               49092
wehkampnl                 37037
FitForFreeNL              23984
CUBE Bikes                22403
bequiet_official          21850
BrewDog                   20653
mymuesli                  20162
fractalofficial           14537
action.deutschland        13492
SNOCKS                    10580
everdrop                   7910
Rent the Runway            7619
N26                        6677
Hungryroot                 6321
weareallbirds              6159
Fanatec                    6030
bangolufsen                4796
OttosBurger                4724
Sorare                     4506
Face Reality Skincare      4164
hansimglück                4025
Simplon                    2773
BugabooDE                  2659
HESSNATUR                  1878
Getaround                  1831
Back Market                1633
OUTFITTERY                 1494
ledlenser                  1483
Coffee Fellows             1433
Gymshark                    415


### Model Selection

For the derivation of the right parameter setting towards the highest $\textit{probability weighted amount of information}$ different Doc2Vec models were trained which differ in the number of $epochs$ (e) and $\textit{vector size}$ (v) while all models have a  $\textit{minimum word count} of 50. 

The following pre-trained models are available:


- vector_space_total_deeplearn_raw_e50_m50_v200
- vector_space_total_deeplearn_raw_e50_m50_v300
- vector_space_total_deeplearn_raw_e50_m50_v400


- vector_space_total_deeplearn_raw_e200_m50_v200
- vector_space_total_deeplearn_raw_e200_m50_v300
- vector_space_total_deeplearn_raw_e200_m50_v400


- vector_space_total_deeplearn_raw_e400_m50_v200
- vector_space_total_deeplearn_raw_e400_m50_v300
- vector_space_total_deeplearn_raw_e400_m50_v400



In [266]:
# select model
model_syntax = 'e50_m50_v300'

# import model
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec.load(f'vector_space_total_deeplearn_raw_{model_syntax}')

Extract $\vec{C}$, $\vec{V}$ and $V$ from the pre trained model

In [267]:
word_vectors = model.wv.get_normed_vectors()
word_indexes = model.wv.key_to_index

document_vectors = model.dv.get_normed_vectors()
vocab = list(model.wv.key_to_index.keys())

### Pre Processing

In [281]:
# PreProcessing of word vectors
from Top2VecModule import calcTopicVectors
word_topic_vecs, words = calcTopicVectors(word_vectors, 15, 5,40, return_cluster = True)
words = pd.DataFrame(columns = ['word', 'cluster'], data = zip(vocab,words))

# remove = 5 # SELECT
remove = 0
remove_words = words[words['cluster'] == remove].word.values
remove_vectors = list(words[words['cluster'] == remove].index)
remove_words

array(['van', 'de', 'haha', 'der', 'hahaha', 'den', 'lisa', 'marie',
       'laura', 'sarah', 'kim', 'michael', 'jessica', 'michelle',
       'nicole', 'sandra', 'linda', 'anne', 'anna', 'daniel', 'jennifer',
       'ann', 'jan', 'thomas', 'julia', 'hahahaha', 'john', 'mike',
       'smith', 'mary', 'da', 'tim', 'kelly', 'ma', 'stephanie', 'en',
       'tom', 'chris', 'maria', 'melanie', 'martin', 'ashley', 'amy',
       'kevin', 'danielle', 'peter', 'hihi', 'robin', 'dennis', 'julie',
       'patrick', 'sabrina', 'nadine', 'denise', 'lena', 'sophie', 'lee',
       'nina', 'alex', 'taylor', 'samantha', 'andrea', 'joe', 'lynn',
       'jenny', 'vanessa', 'amanda', 'joyce', 'sanne', 'jong', 'daisy',
       'frank', 'christina', 'james', 'robert', 'bianca', 'sharon',
       'tina', 'cindy', 'stefan', 'hahah', 'angela', 'nick', 'dijk',
       'mandy', 'heather', 'berg', 'esther', 'jo', 'johnson', 'iris',
       'vries', 'el', 'rachel', 'scott', 'claudia', 'patricia', 'emily',
       'rick'

In [282]:
import numpy as np
word_vectors_red = np.delete(word_vectors,remove_vectors, axis = 0)
vocab_red = np.delete(np.array(vocab),remove_vectors, axis = 0)
vocab_red = vocab_red.tolist()

### Company Selection
For applying Customer2Vec a company has to be chosen. In the following a list of all companies with their respective industries which are used for pre-training the models is shown

In [283]:
# import company information

company_information = pd.read_excel('company_information.xlsx')
company_information.head()

Unnamed: 0,company,Main Industry,Sub Industry,Durable
0,action.deutschland,Tech,supermarket,non durable
1,Back Market,Tech,electronic,durable
2,bangolufsen,Tech,electronic,durable
3,bequiet_official,Tech,electronic,durable
4,Birkenstock,Fashion,fashion,non durable


In [284]:
# choose company
company = 'everdrop'

### Application of Customer2Vec onto chosen company

The modules created for Customer2Vec are written in the Top2VecModule file and are importet into this notebook.

#### Plain topic modeling via Top2Vec

For topic modelling the document reduction procedure has to be performed in order to only use $\vec{d}$ that belong to the chosen company.

In [285]:
company_documents = comments[comments['company'] == company]
company_vectors = document_vectors[company_documents.index.values]

Next to the creation of company specific documents $C_X$ and vectors $\vec{C_X}$, in order to for an accurate modeling of the latent topics $V_X$ and $\vec{V_X}$ have to be created

In [286]:
import numpy as np

check_voc = pd.DataFrame(vocab_red)
check_voc['isin_doc'] = 0

for idx, word in enumerate(vocab_red):
    for doc in company_documents.text:

        if word in doc:
            check_voc.at[idx,'isin_doc'] = 1

company_vocab = [vocab_red[i] for i in check_voc[check_voc['isin_doc'] == 1].index.values]

remove_vectors = list(check_voc[check_voc['isin_doc'] == 0].index.values)
company_word_vectors = np.delete(word_vectors_red,remove_vectors, axis = 0)

Before the topic modeling algorithm Top2Vec can be applied onto the company specific data, parameters $\phi$, $\gamma$ and $\textit{minimum cluster size}$ have to be selected. The derivation of the best parameters was based on a simulation across all companies and then calculating the median of each combination. However it is still possible to select $\phi$, $\gamma$ and $\textit{minimum cluster size}$ individually.

In [287]:
phi = 5
gamma = 5
min_cluster_size = 15

Create $\vec{t}$ based on selected parameters

In [288]:
from Top2VecModule import calcTopicVectors
topic_vectors = calcTopicVectors(company_vectors, phi,gamma,min_cluster_size)

Derive topic words based on $S_{C (\vec{t}, \vec{w_X})}$ with $w_X$ $\in$ $V_X$

In [289]:
from Top2VecModule import _find_topic_words_and_scores
topic_words, topic_scores = _find_topic_words_and_scores(topic_vectors, company_word_vectors,company_vocab)

In [290]:
topic_words

array([['wonderbar', 'environmentally', 'sustainable', ..., 'wheel',
        'hair', 'dishwasher'],
       ['wonderbar', 'ride', 'sustainable', ..., 'bow', 'cleaning',
        'wool'],
       ['pamela', 'ripe', 'everdrop', ..., 'summer', 'hug', 'pets'],
       ...,
       ['unpacked', 'regionally', 'plastic', ..., 'children', 'cotton',
        'washing'],
       ['reusable', 'plastic', 'bags', ..., 'hand', 'sustainability',
        'diapers'],
       ['vegetables', 'plastic', 'unpacked', ..., 'yogurt', 'recycling',
        'seeds']], dtype='<U15')

Due to the small number of $\textit{min_cluster_size}$ the number of topics will be decreased by iteratively merging each topic to it´s closest neighbor.

In [291]:
len(topic_vectors)

155

In [292]:
# reduce topics to k, with k = 20

from Top2VecModule import hierarchical_topic_reduction
from Top2VecModule import _calculate_documents_topic

doc_top, doc_dist = _calculate_documents_topic(topic_vectors, company_vectors, dist=True, num_topics=None)

topic_sizes = pd.Series(doc_top).value_counts()
topic_vectors, doc_top, topic_words, topic_word_scores = hierarchical_topic_reduction(20, topic_vectors, topic_sizes,company_vectors, company_word_vectors, company_vocab)
topic_words

array([['detergent', 'everdrop', 'dishwasher', 'tabs', 'detergents',
        'laundry', 'powder', 'agents', 'fragrance', 'washing', 'rinsing',
        'dissolve', 'cleaner', 'cleaners', 'environmentally', 'cleaning',
        'wonderbar', 'rinse', 'wash', 'enthusiastic', 'packaging',
        'starter', 'packs', 'gel', 'palm', 'flushing', 'unpacked',
        'smells', 'thrilled', 'bottles', 'sustainable', 'shampoo',
        'bottle', 'household', 'soap', 'often', 'fabric', 'towels',
        'agent', 'dryer', 'effect', 'eco', 'washed', 'toilet', 'vinegar',
        'shower', 'containers', 'suitable', 'convinced', 'storage'],
       ['tabs', 'cleaner', 'powder', 'bottle', 'dissolve', 'everdrop',
        'dishwasher', 'toilet', 'bathroom', 'detergent', 'cleaning',
        'cleaners', 'glass', 'bottles', 'rinsing', 'agents', 'rinse',
        'tab', 'kitchen', 'soap', 'plastic', 'purpose', 'detergents',
        'spray', 'crumbled', 'packaging', 'gel', 'starter', 'washing',
        'fragrance',

The $PWI(t)$ can be used to detect differentiated topics

In [293]:
from Top2VecModule import PWI_unigram
PWI, pwi_per_topic = PWI_unigram(company_documents.text,doc_top,topic_words, num_words=20)
pwi_per_topic.sort_values(by = 'PWI', ascending = False)

Unnamed: 0,topic,PWI
7,7,894.863479
10,10,639.727007
6,6,358.156023
9,9,333.737839
13,13,323.907951
12,12,268.876603
3,3,250.359679
2,2,241.919801
11,11,228.316268
19,19,219.511408


#### Topic evolvement and sentiment shifts

In [295]:
# CAUTION: time consuming

from Top2VecModule import getTopicTimeDistribution
year_topic_counts,topic_time_dist = getTopicTimeDistribution(topic_vectors, company_vectors, company_documents, company_word_vectors, company_vocab)

2023-06-07 15:18:55,886 loading file C:\Users\Fabia\.flair\models\sentiment-en-mix-distillbert_4.pt


#### Create brand vocabulary

Since the $\vec{w_X}$ are very close to $\vec{brand}$ a reduction of the $\vec{w}$ is not neccessary.

In [297]:
from Top2VecModule import getProductProxy
brand = company # computer # fruit # bank 

brand_vector = word_vectors[vocab.index(brand)].reshape(1,-1)
company_vocab = getProductProxy(brand_vector,word_vectors, word_indexes)
company_vocab = company_vocab[company_vocab['similarity'] > 0.2].reset_index(drop = True)
company_vocab.head(50)

6024it [00:13, 436.81it/s]


Unnamed: 0,word,similarity
0,everdrop,1.0
1,detergent,0.601391
2,tabs,0.544683
3,packaging,0.527251
4,cleaning,0.505267
5,agents,0.495643
6,plastic,0.489591
7,dishwasher,0.480069
8,detergents,0.471779
9,washing,0.467987


#### Product Mining

For the product mining algorithm first a product list from the brand vocabulary has to be selected

In [296]:
from Top2VecModule import productMining

# product_list = ['porridge','granola','calendar', 'paleo', 'mug', 'oats', 'mixer'] # mymuesli
product_list = ['detergent', 'tabs', 'powder', 'cleaners', 'soap', 'shampoo', 'bottles', 'pads'] # everdrop

company_documents['text'] = company_documents['text'].astype(str)
company_documents.drop(company_documents[company_documents['year'] == 0].index, inplace = True)
results,fig = productMining(product_list, company_documents)

2023-06-07 15:29:01,795 loading file C:\Users\Fabia\.flair\models\sentiment-en-mix-distillbert_4.pt


100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:58<00:00, 22.36s/it]


#### Brand Context Recognition

In [298]:
from Top2VecModule import brandContextRecognition
word_count_filt, fig = brandContextRecognition(company_vocab, company_documents, company,word_vectors,document_vectors,vocab, 100,0.4,15,product_list = product_list)
fig.update_layout(template = 'simple_white',  xaxis_title = '# of documents containing brand context word',  yaxis_title = f'Similarity towards brand word vector', title = f'Brand context of {brand}')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Benchmark Analysis

In order to benchmark the chosen company a competitor (or another company) have to be chosen

In [263]:
# example for keyword search

from Top2VecModule import search_documents_by_keywords
keywords = ['price']
documents, doc_scores, doc_ids = search_documents_by_keywords(model, keywords, word_vectors, word_indexes, comments.text)
documents = pd.DataFrame(columns = ['text','scores', 'company'], data = zip(documents, doc_scores, comments.loc[doc_ids].company.values))
documents['text'] = documents['text'].str.lower()
documents[~documents['text'].str.contains('price', na = False)]

Unnamed: 0,text,scores,company
222,wow. and that's a sale.,0.310700,meijer
322,when is this sale over,0.290884,meijer
383,the product is good. but could be a little che...,0.277258,mymuesli
411,1 product but: d,0.273263,wehkampnl
413,i love this sale!,0.272362,meijer
...,...,...,...
469071,goodnight z! love it!,-0.340726,wehkampnl
469072,love my snockx,-0.343631,SNOCKS
469073,i love my yeezys: d,-0.349504,SNOCKS
469074,i love meijer!!,-0.358443,meijer


In [302]:
from Top2VecModule import radarAnalysis
target = 'Fanatec' # Birkenstock
competitor = 'bequiet_official' # weareallbirds

keyword_sets = list(['quality','price', 'delivery', 'gaming', 'setup'])
# keyword_sets = list(['delivery', 'price', 'comfort', 'size', 'color'])

results,radar_graph = radarAnalysis(model,target, competitor, comments, keyword_sets, 0.2, word_vectors, word_indexes,normalize = False)

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:08<00:00, 13.66s/it]


In [303]:
results

[[0.027739999999999997, 0.7108282608695653],
 [0.24032222222222227, 0.13984285714285713],
 [-0.1235754716981132, -0.296],
 [0.48751333333333324, 0.5126790697674419],
 [0.38716956521739143, 0.4869869565217391]]