# Modeling frequent topics occuring in social posts concerning Nutella

### Camille Strasser

The goal of this analysis is to determine the most frequent topics appearing in French-written posts on social media concerning the brand 'Nutella'. Thanks to this analysis, we are able to know with which concepts French-speaking consumers associate Nutella with.

The mathematical approach to topic modeling is described in the first section of the analysis. For the business conclusions of the study, you can jump to the second section.

# 1 - Technical approach to topic modeling

We want to discover which topics and keywords are the most frequent in a raw text corpus. The most widely used algorithm for topic modeling, Latent Dirichlet Association (LDA), has been chosen for this quick analysis. We will go step by step through the description of the algorithm:
    
    - Cleaning the text corpus
    - Representing the text corpus numerically
    - Performing LDA on the numerical representation

## 1.1 - Cleaning the raw text corpus

For this first step of the algorithm, I created a class performing the cleaning of a raw text corpus:

In [1]:
from utils.data_cleaner import DataCleaner

Let's take a look at the first line of the raw text corpus

In [2]:
raw_corpus_path = '../nutella.csv'
raw_corpus =  DataCleaner.get_raw_corpus_from_path(raw_corpus_path)

# First line of the corpus
example_raw_corpus = raw_corpus[0]
example_raw_corpus

'0|On est samedi soir il est 22h je suis en pyjama dans mon lit devant la télé en train de manger des brioches/Nutella et je me les cailles...|à jakarta le g de nutella il est au même prix que le g de md|Boulettes de fromage, boulettes de viande fourrées au fromage et muffin Nutella... L\'obésité morbide me guette 🍟|Um pote de Nutella...||@Blandine_Laff @MmmYummyFood juste du Nutella c\'est parfait 🙌🏼|🍫🍫 L\'arnaque du siècle ca....😡😡 #nutella #gouter #chocolat #chaud #nesquick #tv #montpellier #sud #oklm #bready #pourri #degueulasse #faux #kinder #bueno|"La Gentilhommière vous invite à découvrir la nouvelle formule Déjeuner de Février\n'

We can observe from this example two main issues in the raw corpus:

    - There is internet related content that we won't be able to process using LDA such as emojis
    - French is a highly inflected language, i.e. words with the same meaning can appear with different forms (singular/plural, conjugation)

### 1.1.1 - Dealing with internet related content

Some internet related items don't convey any meaning about the content, such as:

    - html tags (e.g. <em> some text </em>)
    - url
    - tags refering to other users (e.g. @username)
    
Thus they have to be removed from the corpus
    

Some other items, such as emojis and hashtags, convey a meaning related to the topic of a post. So we want to keep them, but they have to be transformed.

I chose to:
    
    - Remove the # in hashtags but keep the content for (#chocolat -> chocolat)
    - Replace emojis by their name (🍫 -> chocolate bar)
    
You can note that emojis name are in English, it won't create much trouble in the later part of the algorithm because it is 'language agnostic'

As an example, we can see the results on the first line of the corpus:

In [12]:
# Function to print a list on the same line
def print_on_line(list_words):
    line = ''
    for word in list_words:
        line = line + ' ' + word
    return line

In [13]:
# Path for tree tagger module
tree_tagger_directory = '../tree_tagger/'
# OBject to clean french data
cleaner = DataCleaner('french', tree_tagger_directory)

In [14]:
example_after_internet_cleaning = cleaner.get_raw_textual_data_in_document(example_raw_corpus)
print_on_line(example_after_internet_cleaning)

' 0 On est samedi soir il est 22h je suis en pyjama dans mon lit devant la télé en train de manger des brioches/Nutella et je me les cailles... à jakarta le g de nutella il est au même prix que le g de md Boulettes de fromage, boulettes de viande fourrées au fromage et muffin Nutella... L\'obésité morbide me guette  french fries  Um pote de Nutella...      juste du Nutella c\'est parfait  person raising both hands in celebration  emoji modifier fitzpatrick type-3   chocolate bar  chocolate bar  L\'arnaque du siècle ca.... pouting face  pouting face   nutella  gouter  chocolat  chaud  nesquick  tv  montpellier  sud  oklm  bready  pourri  degueulasse  faux  kinder  bueno "La Gentilhommière vous invite à découvrir la nouvelle formule Déjeuner de Février'

### 1.1.2 - Dealing with the language itself

Some type of words don't convey any topic related meaning, such as:

    - unctuations
    - pronouns
    - adverbs
    - prepositions ...
    
They thus need to be removed
   

Remaining words, such as verbs and nouns, need to be standardized. This process of standardization is called 'lemmatization'. It consists in finding the form of the word that you would find in a dictionary. For instance, for French verbs, the lemma of a verb is the infinitive form (e.g. veux -> vouloir)

We can see the results of this cleaning on the first line of the corpus:

In [15]:
example_after_language_cleaning = cleaner.get_important_lemmas_in_textual_data(example_after_internet_cleaning)
print_on_line(example_after_language_cleaning)

' on samedi soir 22h pyjama lit télé train manger brioche nutella caille jakarta g nutella même prix g md boulette fromage boulette viande fourrer fromage muffin nutella obésité morbide guetter french fries um poter nutella juste nutella parfait person raising both hands in celebration emoji modifier fitzpatrick type-3 chocolate bar chocolate bar arnaque siècle ca pouting face pouting face nutella gouter chocolat chaud nesquick tv montpellier sud oklm bready pourrir degueulasse faux kinder bueno gentilhommière inviter découvrir nouveau formule déjeuner février'

### 1.1.3 - Cleaning the whole corpus

We first split the corpus in so called 'documents' (marked by a number of document | tag). Then we can perform both steps described above to obtain a clean text corpus, split into documents.

In [11]:
data_path = '../nutella.csv'
clean_text_corpus = cleaner.get_clean_documents_from_corpus_path(data_path)

Since this process can be long, you can save and/or load the clean corpus from/to file

In [2]:
import pickle

In [17]:
# Save
with open('../clean_text_corpus.pkl', 'wb') as save_file:
    save_file = pickle.dump(clean_text_corpus, save_file)

In [18]:
# Load
with open('../clean_text_corpus.pkl', 'rb') as load_file:
    clean_text_corpus = pickle.load(load_file)

### 1.1.4 - A note on implementation

The above mentioned steps of language normalization are not perfect since it is based on statistical models. The better the model, the better the normalization. Here, the models are based on not so new algorithms, developed by the Tree Tagger Team in the late 90s. 
More recent packages such as nltk couldn't be used because they only perform lemmatization on the English language.
One way to improve this language normalization step could be to create my own accurate models for the French language.

## 1.2 - Representing the clean text corpus numerically

The text corpus need to be represented numerically before being able to process it with the LDA algorithm. To do so, we use a vector space model called the 'Bag of Words' (bow) model.


I created a class to transorm the text corpus into a numerical corpus:

In [19]:
from utils.numerical_corpus import NumericalCorpus

### 1.2.1 - Indexing words of the corpus

First, we index every word which appears in the corpus. It means that an id is given to each word.
For instance, for the first three tweets of the corpus, the word 'nutella' as the id 5.

In [20]:
# First three lines as clean text
first_tree_tweets = example_after_language_cleaning[:32]
print_on_line(first_tree_tweets)

' on samedi soir 22h pyjama lit télé train manger brioche nutella caille jakarta g nutella même prix g md boulette fromage boulette viande fourrer fromage muffin nutella obésité morbide guetter french fries'

In [21]:
# Mapping
mapping = NumericalCorpus.create_mapping_words_to_id_from_text_corpus([first_tree_tweets])
mapping.token2id

{'22h': 9,
 'boulette': 14,
 'brioche': 17,
 'caille': 22,
 'fourrer': 4,
 'french': 10,
 'fries': 0,
 'fromage': 23,
 'g': 26,
 'guetter': 16,
 'jakarta': 18,
 'lit': 19,
 'manger': 15,
 'md': 3,
 'morbide': 25,
 'muffin': 11,
 'même': 12,
 'nutella': 5,
 'obésité': 24,
 'on': 8,
 'prix': 6,
 'pyjama': 21,
 'samedi': 13,
 'soir': 1,
 'train': 20,
 'télé': 2,
 'viande': 7}

### 1.2.2 - Representing documents by vectors

A document is represented by a vector containing at position k+1 the number of time the word indexed by k appears. For instance, in the first three tweets, the word nutella (indexed by 5) appears three times. Thus the 6th vector coordinate is 3.
Such a vector is called a document vector.

In [22]:
_, document_vector = NumericalCorpus.build_from_text_corpus([first_tree_tweets])
document_vector

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 3),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 2),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 2),
  (24, 1),
  (25, 1),
  (26, 2)]]

We represent each document by a vector and arrange them in a matrix, called a document-term matrix. 

### 1.2.3 - Representing the whole corpus

We represent the whole clean text corpus numerically. You can also save/load from file 

In [23]:
dictionary_word2id, document_term_matrix = NumericalCorpus.build_from_text_corpus(clean_text_corpus)

In [24]:
# Save
with open('../numerical_corpus.pkl', 'wb') as save_file:
    pickle.dump(dictionary_word2id, save_file)
    pickle.dump(document_term_matrix, save_file)

In [7]:
# Load
with open('../numerical_corpus.pkl', 'rb') as load_file:
    dictionary_word2id = pickle.load(load_file)
    document_term_matrix = pickle.load(load_file)

### 1.2.4 - Other possible representations

There is other possible numerical representations, as for instance tf-idf. Bag of words has been chosen because of simplicity, but in a further phase of the project, other numerical representations could be chosen.

## 1.3 - Applying the LDA algorithm on the numerical corpus

### 1.3.1 - Explanation of the algorithm

Let A be the document-term matrix from the previous step. Let denote by N the number of documents in the corpus and M the number of words/lemmas in the corpus.
Matrix A has as many rows as documents in the corpus (N) and as many columns as words in the corpus(M).

The goal for the LDA algorithm is to uncover from matrix A:
    
    - The distribution of topics among documents
    - The distribution of terms among topics
    
B denotes the document-topic matrix. Let denote by K the number of topics in the corpus. Each row i of matrix B represents the distribution of topics in the document i (for instance 10% of topic 1, 40% of topic 3, 50% of topic 5). B has thus N rows and K columns

Let denote by C the topic-term matrix. Each row i of matrix C represents the distribution of terms (=keywords) for topic i. C has thus K rows and M columns.


The algorithm iterates through each word “w” for each document “d” and tries to adjust the current topic – word assignment (matrix C) with a new assignment. A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities P1 and P2:

    P1 = P(topic t / document d) = the proportion of words in document d that are currently assigned to topic t = element at row d and column t in matrix B
    
    P2 = p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w = element at row t and column w in matrix C.

The current topic – word assignment is updated with a new topic with the probability, product of P1 and P2 . In this step, the model assumes that all the existing word – topic assignments except the current word are correct. This is essentially the probability that topic t generated word w, so it makes sense to adjust the current word’s topic with new probability.

Matrix B and C are updated if a new topic is assigned to the current word.

After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA.

In a testing phase, the matrix B (document-topic matrix) is used to describe the topics of a specific documents. The matrix C (topic-keyword matrix) is used to describe the most common topics and keywords of the whole corpus. We are mostly interesting in the latter part

## 1.3.2 - A note on algorithm parameters

There are 3 important parameters in this algorithm:
    
    - the number K of topics
    - the number of passes n
    - the number of keywords by topics

To pick the best parameters, we should introduce a validation set. For instance, we could/should split the given dataset into a 80% training set and 20% validation set.
Then we would train a model for each set of parameters (Grid Search for instance) and evaluate in on a the validation set. One possible evaluation metric is the perplexity, which has to be as low as possible. We would then select the model and the set of parameters that yield the lower perplexity.

However, we are conducting a quick analysis to uncover recurrent topics, but we don't want yet to classify new documents. So we don't need our model to be accurate as possible yet.
For now, we only need to set a high number K of topics to discover.

In further study, if we had to classify topics in tweets not already seen, we should select the most accurate model using this procedure.

### 1.3.3 - Training a model and first results

We train a model with the following parameters:
    
    - 100 passes of the algorithm
    - 20 different topics to discover

In [8]:
from utils.topic_modeler import TopicModeler

In [9]:
model = TopicModeler((dictionary_word2id, document_term_matrix), 10)
model.train(100)

KeyboardInterrupt: 

In [30]:
# Save
with open('../trained_model.pkl', 'wb') as save_file:
    pickle.dump(model, save_file)

In [3]:
# Load
with open('../trained_model.pkl', 'rb') as load_file:
    model = pickle.load(load_file)

We then display the most commom topics and their most common keywords

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
model.display_results()

# 2 - Interpretation of the results