In [22]:
def print_on_line(list_words):
    line = ''
    for word in list_words:
        line = line + ' ' + word
    return line

# Topic Modeling on social posts concerning the brand 'Nutella'

### Camille Strasser

In [1]:
from utils.data_cleaner import DataCleaner
from utils.numerical_corpus import NumericalCorpus
from utils.topic_modeler import TopicModeler

In [2]:
import pickle

In [3]:
DataCleaner

utils.data_cleaner.DataCleaner

In [5]:
text_corpus[0]

['on',
 'samedi',
 'soir',
 '22h',
 'pyjama',
 'lit',
 'télé',
 'train',
 'manger',
 'brioche',
 'nutella',
 'caille',
 'jakarta',
 'g',
 'nutella',
 'même',
 'prix',
 'g',
 'md',
 'boulette',
 'fromage',
 'boulette',
 'viande',
 'fourrer',
 'fromage',
 'muffin',
 'nutella',
 'obésité',
 'morbide',
 'guetter',
 'french',
 'fries',
 'um',
 'poter',
 'nutella',
 'juste',
 'nutella',
 'parfait',
 'person',
 'raising',
 'both',
 'hands',
 'in',
 'celebration',
 'emoji',
 'modifier',
 'fitzpatrick',
 'type-3',
 'chocolate',
 'bar',
 'chocolate',
 'bar',
 'arnaque',
 'siècle',
 'ca',
 'pouting',
 'face',
 'pouting',
 'face',
 'nutella',
 'gouter',
 'chocolat',
 'chaud',
 'nesquick',
 'tv',
 'montpellier',
 'sud',
 'oklm',
 'bready',
 'pourrir',
 'degueulasse',
 'faux',
 'kinder',
 'bueno',
 'gentilhommière',
 'inviter',
 'découvrir',
 'nouveau',
 'formule',
 'déjeuner',
 'février',
 'menu',
 'fevrier',
 'salade',
 'thaï',
 'calamars',
 'poires',
 'rôties',
 'coriandre',
 'fraiche',
 'wrap',


In [None]:
text_corpus_2[-1]

In [None]:
file_2 = open('unit_tests/example_data/raw_data.csv')

In [None]:
file_2.readlines()

In [None]:
file = open('../nutella.csv')

In [None]:
file.readlines()

The goal of this analysis is to determine the most frequent topics appearing in French-written posts on social media concerning the brand 'Nutella'. Thanks to this analysis, we are able to know with which concepts French-speaking consumers associate Nutella with.

The mathematical approach to topic modeling is described in the first section of the analysis. For the business conclusions of the study, you can jump to the second section.

# 1 - Technical approach to topic modeling

We want to discover which topics and keywords are the most frequent in a raw text corpus. The most widely used algorithm for topic modeling, Latent Dirichlet Association (LDA), has been chosen for this quick analysis. We will go step by step through the description of the algorithm:
    
    - Cleaning the text corpus
    - Find a numerical representation
    - Perform LDA on the numerical representation

## 1.1 - Cleaning the raw text corpus

For this step of the algorithm, the functions in the following package will be used:

In [7]:
from utils.data_cleaner import DataCleaner

Let's take a look at the first line of the raw text corpus

In [17]:
raw_corpus_path = '../nutella.csv'
raw_corpus =  DataCleaner.get_raw_corpus_from_path(raw_corpus_path)

# First line of the corpus
example_raw_corpus = raw_corpus[0]
example_raw_corpus

'0|On est samedi soir il est 22h je suis en pyjama dans mon lit devant la télé en train de manger des brioches/Nutella et je me les cailles...|à jakarta le g de nutella il est au même prix que le g de md|Boulettes de fromage, boulettes de viande fourrées au fromage et muffin Nutella... L\'obésité morbide me guette 🍟|Um pote de Nutella...||@Blandine_Laff @MmmYummyFood juste du Nutella c\'est parfait 🙌🏼|🍫🍫 L\'arnaque du siècle ca....😡😡 #nutella #gouter #chocolat #chaud #nesquick #tv #montpellier #sud #oklm #bready #pourri #degueulasse #faux #kinder #bueno|"La Gentilhommière vous invite à découvrir la nouvelle formule Déjeuner de Février\n'

We can observe from this example to issues in the raw corpus:

    - There is internet related content that we won't be able to process using LDA such as emojis
    - French is a highly inflected language, i.e. words with the same meaning can appear with different forms (singular/plural, conjugation)

### 1.1.1 - Dealing with internet related content

Some internet related items don't convey any meaning about the content, such as:

    -html tags (e.g. <em> some text </em>
    -url
    -refering to other users (e.g. @username)
    
Thus they have to be removed from the corpus
    

Some other items, such as emojis and hashtags, convey a meaning related to the topic of a post. So we want to keep them, but they have to be transformed.

I chose to:
    
    - Removing the # in hashtags but keeping the content, for instance #chocolat -> chocolat
    - Replacing emojis by their name, for instance 🍫 -> chocolate bar
    
You can note that emojis name are in English, it won't create much trouble in the later part of the algorithm because it is 'language agnostic'

As an example, we can see the results on the first line of the corpus:

In [24]:
example_after_internet_cleaning = DataCleaner('french').get_raw_textual_data_in_document(example_raw_corpus)
print_on_line(example_after_internet_cleaning)

' 0 On est samedi soir il est 22h je suis en pyjama dans mon lit devant la télé en train de manger des brioches/Nutella et je me les cailles... à jakarta le g de nutella il est au même prix que le g de md Boulettes de fromage, boulettes de viande fourrées au fromage et muffin Nutella... L\'obésité morbide me guette  french fries  Um pote de Nutella...      juste du Nutella c\'est parfait  person raising both hands in celebration  emoji modifier fitzpatrick type-3   chocolate bar  chocolate bar  L\'arnaque du siècle ca.... pouting face  pouting face   nutella  gouter  chocolat  chaud  nesquick  tv  montpellier  sud  oklm  bready  pourri  degueulasse  faux  kinder  bueno "La Gentilhommière vous invite à découvrir la nouvelle formule Déjeuner de Février'

### 1.1.2 - Dealing with the language itself

Some type of words don't convey any topic related meaning, such as:

    -punctuation
    -pronouns
    -adverbs
    -preposition ...
    
They thus need to be removed
   

Remaining words, such as verbs and nouns, need to be standardized. This process of standardization is called 'lemmatization'. It consists in finding the form of the word that you could find in a dictionary. For instance, for French verbs, the lemma of a verb is the infinitive form (e.g. veux -> vouloir)

We can see the results of this cleaning on the first line of the corpus:

In [27]:
example_after_language_cleaning = DataCleaner('french').get_important_lemmas_in_textual_data(example_after_internet_cleaning)
print_on_line(example_after_language_cleaning)

' on samedi soir 22h pyjama lit télé train manger brioche nutella caille jakarta g nutella même prix g md boulette fromage boulette viande fourrer fromage muffin nutella obésité morbide guetter french fries um poter nutella juste nutella parfait person raising both hands in celebration emoji modifier fitzpatrick type-3 chocolate bar chocolate bar arnaque siècle ca pouting face pouting face nutella gouter chocolat chaud nesquick tv montpellier sud oklm bready pourrir degueulasse faux kinder bueno gentilhommière inviter découvrir nouveau formule déjeuner février'

### 1.1.3 - Cleaning the whole corpus

We first split the corpus in so called 'documents' (marked by a number of document | tag). Then we can perform both steps described above to obtain a clean text corpus, split into documents.

In [4]:
data_path = '../nutella.csv'
text_corpus = DataCleaner('french').get_clean_documents_from_corpus_path(data_path)

Since this process can be long, you save and/or load the clean corpus from/to file

In [28]:
# Save
with open('../clean_text_corpus.pkl', 'wb') as save_file:
    save_file = pickle.dump(text_corpus, save_file)

In [30]:
# Load
with open('../clean_text_corpus.pkl', 'rb') as load_file:
    clean_text_corpus = pickle.load(load_file)

### 1.1.4 - A note on implementation

The above mentioned step of language normalization is not perfect since it is based on statistical model. The better the model, the better the normalization. Here, the model are based on not so new algorithms, developed by the Tree Tagger Team in the late 90s. 
More recent packages such as nltk couldn't be use because they only perform lemmatization on the English language.
One way to improve this language normalization step could be to create my own accurate models for the French language.