# Topic Modeling on social posts concerning the brand 'Nutella'

### Camille Strasser

The goal of this analysis is to determine the most frequent topics appearing in French-written posts on social media concerning the brand 'Nutella'. Thanks to this analysis, we are able to know with which concepts French-speaking consumers associate Nutella with.

The mathematical approach to topic modeling is described in the first section of the analysis. For the business conclusions of the study, you can jump to the second section.

# 1 - Technical approach to topic modeling

We want to discover which topics and keywords are the most frequent in a raw text corpus. The most widely used algorithm for topic modeling, Latent Dirichlet Association (LDA), has been chosen for this quick analysis. We will go step by step through the description of the algorithm:
    
    - Cleaning the text corpus
    - Find a numerical representation
    - Perform LDA on the numerical representation

## 1.1 - Cleaning the raw text corpus

For this step of the algorithm, the functions in the following package will be used:

In [1]:
from utils.data_cleaner import DataCleaner

Let's take a look at the first line of the raw text corpus

In [2]:
raw_corpus_path = '../nutella.csv'
raw_corpus =  DataCleaner.get_raw_corpus_from_path(raw_corpus_path)

# First line of the corpus
example_raw_corpus = raw_corpus[0]
example_raw_corpus

'0|On est samedi soir il est 22h je suis en pyjama dans mon lit devant la télé en train de manger des brioches/Nutella et je me les cailles...|à jakarta le g de nutella il est au même prix que le g de md|Boulettes de fromage, boulettes de viande fourrées au fromage et muffin Nutella... L\'obésité morbide me guette 🍟|Um pote de Nutella...||@Blandine_Laff @MmmYummyFood juste du Nutella c\'est parfait 🙌🏼|🍫🍫 L\'arnaque du siècle ca....😡😡 #nutella #gouter #chocolat #chaud #nesquick #tv #montpellier #sud #oklm #bready #pourri #degueulasse #faux #kinder #bueno|"La Gentilhommière vous invite à découvrir la nouvelle formule Déjeuner de Février\n'

We can observe from this example to issues in the raw corpus:

    - There is internet related content that we won't be able to process using LDA such as emojis
    - French is a highly inflected language, i.e. words with the same meaning can appear with different forms (singular/plural, conjugation)

### 1.1.1 - Dealing with internet related content

Some internet related items don't convey any meaning about the content, such as:

    -html tags (e.g. <em> some text </em>
    -url
    -refering to other users (e.g. @username)
    
Thus they have to be removed from the corpus
    

Some other items, such as emojis and hashtags, convey a meaning related to the topic of a post. So we want to keep them, but they have to be transformed.

I chose to:
    
    - Removing the # in hashtags but keeping the content, for instance #chocolat -> chocolat
    - Replacing emojis by their name, for instance 🍫 -> chocolate bar
    
You can note that emojis name are in English, it won't create much trouble in the later part of the algorithm because it is 'language agnostic'

As an example, we can see the results on the first line of the corpus:

In [3]:
# Function to print a list on the same line
def print_on_line(list_words):
    line = ''
    for word in list_words:
        line = line + ' ' + word
    return line

In [4]:
example_after_internet_cleaning = DataCleaner('french').get_raw_textual_data_in_document(example_raw_corpus)
print_on_line(example_after_internet_cleaning)

' 0 On est samedi soir il est 22h je suis en pyjama dans mon lit devant la télé en train de manger des brioches/Nutella et je me les cailles... à jakarta le g de nutella il est au même prix que le g de md Boulettes de fromage, boulettes de viande fourrées au fromage et muffin Nutella... L\'obésité morbide me guette  french fries  Um pote de Nutella...      juste du Nutella c\'est parfait  person raising both hands in celebration  emoji modifier fitzpatrick type-3   chocolate bar  chocolate bar  L\'arnaque du siècle ca.... pouting face  pouting face   nutella  gouter  chocolat  chaud  nesquick  tv  montpellier  sud  oklm  bready  pourri  degueulasse  faux  kinder  bueno "La Gentilhommière vous invite à découvrir la nouvelle formule Déjeuner de Février'

### 1.1.2 - Dealing with the language itself

Some type of words don't convey any topic related meaning, such as:

    -punctuation
    -pronouns
    -adverbs
    -preposition ...
    
They thus need to be removed
   

Remaining words, such as verbs and nouns, need to be standardized. This process of standardization is called 'lemmatization'. It consists in finding the form of the word that you could find in a dictionary. For instance, for French verbs, the lemma of a verb is the infinitive form (e.g. veux -> vouloir)

We can see the results of this cleaning on the first line of the corpus:

In [5]:
example_after_language_cleaning = DataCleaner('french').get_important_lemmas_in_textual_data(example_after_internet_cleaning)
print_on_line(example_after_language_cleaning)

' on samedi soir 22h pyjama lit télé train manger brioche nutella caille jakarta g nutella même prix g md boulette fromage boulette viande fourrer fromage muffin nutella obésité morbide guetter french fries um poter nutella juste nutella parfait person raising both hands in celebration emoji modifier fitzpatrick type-3 chocolate bar chocolate bar arnaque siècle ca pouting face pouting face nutella gouter chocolat chaud nesquick tv montpellier sud oklm bready pourrir degueulasse faux kinder bueno gentilhommière inviter découvrir nouveau formule déjeuner février'

### 1.1.3 - Cleaning the whole corpus

We first split the corpus in so called 'documents' (marked by a number of document | tag). Then we can perform both steps described above to obtain a clean text corpus, split into documents.

In [6]:
data_path = '../nutella.csv'
clean_text_corpus = DataCleaner('french').get_clean_documents_from_corpus_path(data_path)

Since this process can be long, you save and/or load the clean corpus from/to file

In [2]:
import pickle

In [7]:
# Save
with open('../clean_text_corpus.pkl', 'wb') as save_file:
    save_file = pickle.dump(clean_text_corpus, save_file)

In [3]:
# Load
with open('../clean_text_corpus.pkl', 'rb') as load_file:
    clean_text_corpus = pickle.load(load_file)

### 1.1.4 - A note on implementation

The above mentioned step of language normalization is not perfect since it is based on statistical model. The better the model, the better the normalization. Here, the model are based on not so new algorithms, developed by the Tree Tagger Team in the late 90s. 
More recent packages such as nltk couldn't be use because they only perform lemmatization on the English language.
One way to improve this language normalization step could be to create my own accurate models for the French language.

## 1.2 - Representing the clean text corpus numerically

In [4]:
from utils.numerical_corpus import NumericalCorpus

The text corpus need to be represented numerically before being able to process it with the LDA algorithm. To do so, we use a vector space model called the 'Bag of Words' (bow) model.


### 1.2.1 - Indexing words of the corpus

First, we index every word which appears in the corpus. It means that an id is given to each word.
For instance, for the first three tweets of the corpus, the word 'nutella' as the id 5.

In [20]:
# First three lines as clean text
first_tree_tweets = example_after_language_cleaning[:32]
print_on_line(first_tree_tweets)

' on samedi soir 22h pyjama lit télé train manger brioche nutella caille jakarta g nutella même prix g md boulette fromage boulette viande fourrer fromage muffin nutella obésité morbide guetter french fries'

In [22]:
# Mapping
mapping = NumericalCorpus.create_mapping_words_to_id_from_text_corpus([first_tree_tweets])
mapping.token2id

{'22h': 24,
 'boulette': 12,
 'brioche': 16,
 'caille': 11,
 'fourrer': 0,
 'french': 7,
 'fries': 8,
 'fromage': 18,
 'g': 19,
 'guetter': 9,
 'jakarta': 22,
 'lit': 25,
 'manger': 10,
 'md': 20,
 'morbide': 26,
 'muffin': 21,
 'même': 2,
 'nutella': 5,
 'obésité': 1,
 'on': 13,
 'prix': 17,
 'pyjama': 4,
 'samedi': 6,
 'soir': 3,
 'train': 23,
 'télé': 15,
 'viande': 14}

### 1.2.2 - Representing documents by vectors

A document is represented by a vector containing at position k+1 the number of time the word indexed by k appears. For instance, in the first three tweets, the word nutella (indexed by 5) appears three times. Thus the 6th vector coordinate is 3.
Such a vector is called a document vector.

In [27]:
_, document_vector = NumericalCorpus.build_from_text_corpus([first_tree_tweets])
document_vector

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 3),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 2),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 2),
  (19, 2),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1)]]

We represent each document by a vector and arrange them in a matrix, called a document-term matrix. 

### 1.2.3 - Representing the whole corpus

We represent the whole clean text corpus numerically. You can also save/load from file 

In [5]:
dictionary_word2id, document_term_matrix = NumericalCorpus.build_from_text_corpus(clean_text_corpus)

In [11]:
# Save
with open('../numerical_corpus.pkl', 'wb') as save_file:
    pickle.dump(dictionary_word2id, save_file)
    pickle.dump(document_term_matrix, save_file)

In [12]:
# Load
with open('../numerical_corpus.pkl', 'rb') as load_file:
    dictionary_word2id = pickle.load(load_file)
    document_term_matrix = pickle.load(load_file)

### 1.2.4 - Other possible representations

There is other possible numerical representations, as for instance tf-idf. Bag of words has been chosen because of simplicity, but in a second phase of the project, other numerical representations could be chosen.

## 1.3 - Applying the LDA algorithm on the numerical corpus

In [6]:
from utils.topic_modeler import TopicModeler

In [7]:
model = TopicModeler((dictionary_word2id, document_term_matrix), 10)

In [8]:
model.train(50)

In [9]:
model.get_most_important_topics_and_keywords(10)

[(0,
  '0.008*"nutella" + 0.007*"faire" + 0.004*"dire" + 0.004*"prendre" + 0.004*"jour" + 0.004*"petit" + 0.003*"manger" + 0.003*"pouvoir" + 0.003*"j’ai" + 0.003*"voir"'),
 (1,
  '0.011*"*" + 0.004*"faire" + 0.002*"savoir" + 0.002*"$" + 0.002*"devoir" + 0.002*"mettre" + 0.002*"pouvoir" + 0.002*"il" + 0.002*"dire" + 0.002*"nutella"'),
 (2,
  '0.042*"nutella" + 0.013*"face" + 0.012*"faire" + 0.008*"bon" + 0.008*">" + 0.007*"aller" + 0.007*"manger" + 0.007*"with" + 0.006*"moi" + 0.006*"petit"'),
 (3,
  '0.011*"faire" + 0.010*"nutella" + 0.006*"*" + 0.006*"g" + 0.006*"moi-" + 0.004*"aller" + 0.004*"face" + 0.004*"bon" + 0.004*"matt-" + 0.004*"ajouter"'),
 (4,
  '0.015*"*" + 0.012*"aimer" + 0.010*"recette" + 0.010*"faire" + 0.008*"ricotta" + 0.008*"juin" + 0.008*"face" + 0.008*"écrire" + 0.007*"pâte" + 0.006*"merci"'),
 (5,
  '0.012*"nutella" + 0.007*"faire" + 0.006*"poster" + 0.006*"petit" + 0.005*"g" + 0.005*"aller" + 0.005*"régime" + 0.004*"loudly" + 0.004*"crying" + 0.004*"face"'),
 (6,

In [19]:
import pyLDAvis.gensim

In [21]:
topic_vis_data = pyLDAvis.gensim.prepare(model.model, document_term_matrix, dictionary_word2id)
pyLDAvis.display(topic_vis_data)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


In [13]:
model_2 = TopicModeler((dictionary_word2id, document_term_matrix), 5)

In [15]:
model_2.train(50)

In [17]:
model_2.get_most_important_topics_and_keywords(5)

[(0,
  '0.016*"nutella" + 0.012*"faire" + 0.007*"g" + 0.006*"petit" + 0.006*"aimer"'),
 (1,
  '0.044*"nutella" + 0.025*"face" + 0.016*"with" + 0.011*"smiling" + 0.010*"faire"'),
 (2, '0.015*"moi" + 0.015*"castiel" + 0.014*">" + 0.010*"^^" + 0.008*"face"'),
 (3,
  '0.015*"nutella" + 0.013*"billet" + 0.008*"ajouter" + 0.008*"cœur" + 0.008*"coup"'),
 (4,
  '0.055*"nutella" + 0.021*"face" + 0.013*"with" + 0.010*"faire" + 0.008*"smiling"')]

In [22]:
topic_vis_data_2 = pyLDAvis.gensim.prepare(model_2.model, document_term_matrix, dictionary_word2id)
pyLDAvis.display(topic_vis_data_2)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


In [24]:
model_3 = TopicModeler((dictionary_word2id, document_term_matrix), 20)

In [25]:
model_3.train(100)

In [26]:
model_3.get_most_important_topics_and_keywords(20)

[(0,
  '0.012*"g" + 0.009*"faire" + 0.008*"cuillerée" + 0.006*"beurre" + 0.005*"plat" + 0.005*"préparation" + 0.005*"irlandais" + 0.005*"soupe" + 0.005*"nutella" + 0.004*"pour" + 0.004*"four" + 0.004*"ingrédient" + 0.004*"café" + 0.004*"ml" + 0.004*"oignon" + 0.004*"ajouter" + 0.004*"j\'" + 0.004*"petit" + 0.004*"un" + 0.004*"gros"'),
 (1,
  '0.027*"nutella" + 0.018*"*" + 0.010*"face" + 0.009*"emmerder" + 0.009*"with" + 0.006*"smiling" + 0.005*"eyes" + 0.005*"bon" + 0.005*"petit" + 0.004*"heart-shaped" + 0.003*"chocolat" + 0.003*"préférer" + 0.003*"red" + 0.003*"$" + 0.003*"food" + 0.003*"down-pointing" + 0.003*"triangle" + 0.002*"small" + 0.002*"aimer" + 0.002*"aller"'),
 (2,
  '0.049*"loudly" + 0.048*"crying" + 0.010*"face" + 0.006*"€" + 0.005*"illustration" + 0.004*"info" + 0.003*"actionfrance" + 0.003*"photoshop" + 0.003*"deco" + 0.003*"actu" + 0.002*"^^" + 0.002*"messagemignion" + 0.002*"an" + 0.002*"logo" + 0.002*"prison" + 0.002*"infographie" + 0.002*"cartedevisite" + 0.002*"aff

In [27]:
topic_vis_data_3 = pyLDAvis.gensim.prepare(model_3.model, document_term_matrix, dictionary_word2id)
pyLDAvis.display(topic_vis_data_3)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


### 1.3.1 - Explanation of the algorithm

Let A be the document-term matrix from the previous step. Let denote by N the number of documents in the corpus and M the number of words/lemmas in the corpus.
Matrix A has as many rows as documents in the corpus (N) and as many columns as words in the corpus(M).

The goal for the LDA algorithm is to uncover:
    
    - The distribution of topics among the documents
    - The distribution of terms among topics
    
Let denote by B the document-topic matrix. Let denotes by K the number of topics in the corpus. Each row i of matrix B represents the distribution of topics in the document i (for instance 10% of topic 1, 40% of topic 3, 50% of topic 5). B has thus N rows and K columns

Let denote by C the topic-term matrix. Each row i of matrix C represents the distribution of terms (=keywords) for topic i. C has thus K rows and M columns.

In a sense, B and C are a decomposition of A (A = B * C)

In [None]:
It Iterates through each word “w” for each document “d” and tries to adjust the current topic – word assignment with a new assignment. A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities p1 and p2.

For every topic, two probabilities p1 and p2 are calculated. P1 – p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 – p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w.

The current topic – word assignment is updated with a new topic with the probability, product of p1 and p2 . In this step, the model assumes that all the existing word – topic assignments except the current word are correct. This is essentially the probability that topic t generated word w, so it makes sense to adjust the current word’s topic with new probability.

After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA.