# Notebook d'entraînement des modèles LDA

## **OBJECTIF GENERAL**

**Entraîner les modèle LDA sur les 3 revues**

## **OBJECTIFS SPECIFIQUES**

- Préparer les matrices document/mot comme input des modèles LDA
- Entraîner les modèles LDA avec changement des 3 hyperparamètres : nombre de thèmes - sac de mot/ tf-idf - lemmatisation ou non du vocabulaire 
- Entraîner 5 modèles avec même configuration pour les 3 revues : étude de fiabilité

**Import des bibliothèques et données**

In [1]:
import gensim
from gensim.models import LdaModel
from gensim import models

import random
from tqdm import tqdm



**Imports AE**

In [2]:
# les 2 premières variables sont issues du notebook Preprocessing_v2

# liste de tokens pour chaque article
%store -r tokens_bigrams_Corpus_LDA_AE_clean 
#%store -r tokens_bigrams_Corpus_LDA_AE_clean_lemma 

# dictionnaire sans lemmatisation
%store -r dictionary_AE_2 

# dictionnaire avec lemmatisation
%store -r dictionary_AE_2_lemma 

**Imports EI**

In [3]:
# les 2 premières variables sont issues du notebook Preprocessing_v2

# liste de tokens pour chaque article
%store -r tokens_bigrams_Corpus_LDA_EI_clean 
%store -r tokens_bigrams_Corpus_LDA_EI_clean_lemma 

# dictionnaire sans lemmatisation
%store -r dictionary_EI_2 

# dictionnaire avec lemmatisation
%store -r dictionary_EI_2_lemma 

**Imports RI**

In [4]:
# les 2 premières variables sont issues du notebook Preprocessing_v2

# liste de tokens pour chaque article
%store -r tokens_bigrams_Corpus_LDA_RI_clean 
%store -r tokens_bigrams_Corpus_LDA_RI_clean_lemma 

# dictionnaire sans lemmatisation
%store -r dictionary_RI_2 

# dictionnaire avec lemmatisation
%store -r dictionary_RI_2_lemma 

## **I. Vectorisation du corpus et séparation en train/test**

### REVUE AE

In [None]:
random.seed(10)

In [None]:
train_proportion = 0.8 #test_proportion = 1 - train_proportion
train_len = int(train_proportion * len(tokens_bigrams_Corpus_LDA_AE_clean))
print("Taille de l'ensemble d'entraînement: ", train_len, "articles.")

**USAGE : ne créer les corpus qu'une seule fois**

**SANS LEMMATISATION**

In [None]:
random.shuffle(tokens_bigrams_Corpus_LDA_AE_clean)

In [None]:
train_texts_AE = tokens_bigrams_Corpus_LDA_AE_clean[:train_len]
test_texts_AE = tokens_bigrams_Corpus_LDA_AE_clean[train_len:]

# Création de la matrice Bag-of-words pour le corpus
corpus_train_AE = [dictionary_AE_2.doc2bow(doc) for doc in train_texts_AE]
corpus_test_AE = [dictionary_AE_2.doc2bow(doc) for doc in test_texts_AE]

# Version tf-idf
tfidf = models.TfidfModel(corpus_train_AE)
corpus_train_AE_tfidf = tfidf[corpus_train_AE]
corpus_test_AE_tfidf = tfidf[corpus_test_AE]

In [None]:
%store train_texts_AE
%store test_texts_AE
%store corpus_train_AE
%store corpus_test_AE
%store corpus_train_AE_tfidf
%store corpus_test_AE_tfidf

**AVEC LEMMATISATION**

**REMARQUE** : pas le même shuffle pour les corpus avec et sans lemmatisation

In [None]:
random.shuffle(tokens_bigrams_Corpus_LDA_AE_clean_lemma)

In [None]:
train_texts_AE_lemma = tokens_bigrams_Corpus_LDA_AE_clean_lemma[:train_len]
test_texts_AE_lemma = tokens_bigrams_Corpus_LDA_AE_clean_lemma[train_len:]

# Création de la matrice Bag-of-words pour le corpus
corpus_train_AE_lemma = [dictionary_AE_2_lemma.doc2bow(doc) for doc in train_texts_AE_lemma]
corpus_test_AE_lemma = [dictionary_AE_2_lemma.doc2bow(doc) for doc in test_texts_AE_lemma]

# Version tf-idf
tfidf = models.TfidfModel(corpus_train_AE_lemma)
corpus_train_AE_tfidf_lemma = tfidf[corpus_train_AE_lemma]
corpus_test_AE_tfidf_lemma = tfidf[corpus_test_AE_lemma]

In [None]:
%store train_texts_AE_lemma
%store test_texts_AE_lemma
%store corpus_train_AE_lemma
%store corpus_test_AE_lemma
%store corpus_train_AE_tfidf_lemma
%store corpus_test_AE_tfidf_lemma

### REVUE EI

**SANS LEMMATISATION**

In [None]:
random.shuffle(tokens_bigrams_Corpus_LDA_EI_clean)

In [None]:
train_texts_EI = tokens_bigrams_Corpus_LDA_EI_clean[:train_len]
test_texts_EI = tokens_bigrams_Corpus_LDA_EI_clean[train_len:]

# Création de la matrice Sac de mots pour le corpus
corpus_train_EI = [dictionary_EI_2.doc2bow(doc) for doc in train_texts_EI]
corpus_test_EI = [dictionary_EI_2.doc2bow(doc) for doc in test_texts_EI]

# Version tf-idf
tfidf = models.TfidfModel(corpus_train_EI)
corpus_train_EI_tfidf = tfidf[corpus_train_EI]
corpus_test_EI_tfidf = tfidf[corpus_test_EI]

In [None]:
%store train_texts_EI
%store test_texts_EI
%store corpus_train_EI
%store corpus_test_EI
%store corpus_train_EI_tfidf
%store corpus_test_EI_tfidf

**AVEC LEMMATISATION**

**REMARQUE** : pas le même shuffle pour les 2 corpus !


In [None]:
random.shuffle(tokens_bigrams_Corpus_LDA_EI_clean_lemma)

In [None]:
train_texts_EI_lemma = tokens_bigrams_Corpus_LDA_EI_clean_lemma[:train_len]
test_texts_EI_lemma = tokens_bigrams_Corpus_LDA_EI_clean_lemma[train_len:]

# Création de la matrice Bag-of-words pour le corpus
corpus_train_EI_lemma = [dictionary_EI_2_lemma.doc2bow(doc) for doc in train_texts_EI_lemma]
corpus_test_EI_lemma = [dictionary_EI_2_lemma.doc2bow(doc) for doc in test_texts_EI_lemma]

# Version tf-idf
tfidf = models.TfidfModel(corpus_train_EI_lemma)
corpus_train_EI_tfidf_lemma = tfidf[corpus_train_EI_lemma]
corpus_test_EI_tfidf_lemma = tfidf[corpus_test_EI_lemma]

In [None]:
%store train_texts_EI_lemma
%store test_texts_EI_lemma
%store corpus_train_EI_lemma
%store corpus_test_EI_lemma
%store corpus_train_EI_tfidf_lemma
%store corpus_test_EI_tfidf_lemma

### REVUE RI

**SANS LEMMATISATION**

In [None]:
random.shuffle(tokens_bigrams_Corpus_LDA_RI_clean)

In [None]:
train_texts_RI = tokens_bigrams_Corpus_LDA_RI_clean[:train_len]
test_texts_RI = tokens_bigrams_Corpus_LDA_RI_clean[train_len:]

# Création de la matrice Bag-of-words pour le corpus
corpus_train_RI = [dictionary_RI_2.doc2bow(doc) for doc in train_texts_RI]
corpus_test_RI = [dictionary_RI_2.doc2bow(doc) for doc in test_texts_RI]

# Version tf-idf
tfidf = models.TfidfModel(corpus_train_RI)
corpus_train_RI_tfidf = tfidf[corpus_train_RI]
corpus_test_RI_tfidf = tfidf[corpus_test_RI]

In [None]:
%store train_texts_RI
%store test_texts_RI
%store corpus_train_RI
%store corpus_test_RI
%store corpus_train_RI_tfidf
%store corpus_test_RI_tfidf

**AVEC LEMMATISATION**

**REMARQUE** : pas le même shuffle pour les 2 corpus !


In [None]:
random.shuffle(tokens_bigrams_Corpus_LDA_RI_clean_lemma)

In [None]:
train_texts_RI_lemma = tokens_bigrams_Corpus_LDA_RI_clean_lemma[:train_len]
test_texts_RI_lemma = tokens_bigrams_Corpus_LDA_RI_clean_lemma[train_len:]

# Création de la matrice Bag-of-words pour le corpus
corpus_train_RI_lemma = [dictionary_RI_2_lemma.doc2bow(doc) for doc in train_texts_RI_lemma]
corpus_test_RI_lemma = [dictionary_RI_2_lemma.doc2bow(doc) for doc in test_texts_RI_lemma]

# Version tf-idf
tfidf = models.TfidfModel(corpus_train_RI_lemma)
corpus_train_RI_tfidf_lemma = tfidf[corpus_train_RI_lemma]
corpus_test_RI_tfidf_lemma = tfidf[corpus_test_RI_lemma]

In [None]:
%store train_texts_RI_lemma
%store test_texts_RI_lemma
%store corpus_train_RI_lemma
%store corpus_test_RI_lemma
%store corpus_train_RI_tfidf_lemma
%store corpus_test_RI_tfidf_lemma

## **II. Entraînement du modèle**

In [5]:
def train_model_LDA(corpus, dictionary, revue = 'AE', alpha = 'auto', num_topic = 5, chunksize=500, passes=20, iterations=400, eval_every=None):
    """ Hyper-paramètres du modèle :
    - num_topic = nombre de topics
    - chunksize = nombre de documents entraînés chaque itération
    passes = nombre d'époques d'entraînement
    iterations = nombre de fois qu'on fait converger l'algorithme VB à chaque époque
    eval_every = Evaluation de la perplexité pendant entraînement (exigeant)
    """

    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token
    
    return LdaModel(corpus=corpus,id2word=id2word, chunksize=chunksize, \
                       alpha=alpha, eta='auto', \
                       iterations=iterations, num_topics=num_topic, \
                       passes=passes, eval_every=eval_every)

**Hyper-paramètres**

In [6]:
num_topics = [x for x in range(1,10)] + [x for x in range(10,110,10)]
chunksize=500
passes=20
iterations=400
eval_every= None
save = True       

### REVUE AE

**MODELE SAC DE MOTS**

**SANS LEMMATISATION**

In [7]:
%store -r corpus_train_AE
dictionary = dictionary_AE_2
temp = dictionary[0] 
id2word = dictionary.id2token

In [9]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_AE, dictionary = dictionary_AE_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/AE/lda_ae_'+ str(num_topic))

  diff = np.log(self.expElogbeta)
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [09:10<00:00, 550.14s/it]


**AVEC LEMMATISATION**

In [None]:
%store -r corpus_train_AE_lemma
dictionary = dictionary_AE_2_lemma
temp = dictionary[0] 
id2word = dictionary.id2token

In [None]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_AE_lemma, dictionary = dictionary_AE_2_lemma, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/AE/lda_ae_lemma_'+ str(num_topic))

**MODELE TF IDF**

**SANS LEMMATISATION**

In [None]:
%store -r corpus_train_AE
dictionary = dictionary_AE_2
temp = dictionary[0] 
id2word = dictionary.id2token

In [None]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_AE, dictionary = dictionary_AE_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/AE/lda_ae_tfidf_'+ str(num_topic))

**AVEC LEMMATISATION**

In [None]:
%store -r corpus_train_AE_lemma
dictionary = dictionary_AE_2_lemma
temp = dictionary[0] 
id2word = dictionary.id2token

In [None]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_AE_lemma, dictionary = dictionary_AE_2_lemma, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/AE/lda_ae_tfidf_lemma_'+ str(num_topic))

**ETUDE FIABILITE**

In [5]:
%store -r corpus_train_AE
dictionary = dictionary_AE_2
temp = dictionary[0] 
id2word = dictionary.id2token

In [6]:
# Entraîner 5 modèles avec 10 thèmes et 5 modèles à 40 thèmes
for num_topic in [10,40]:
    for iteration in tqdm(range(5)):
        model = train_model_LDA(corpus=corpus_train_AE, dictionary = dictionary_AE_2, num_topic= num_topic, eval_every = None)
        #model.save('Résultats_LDA/Fiabilité/lda_ae_'+ str(num_topic) +'_' + str(iteration))
        model.save('Résultats_LDA/lda_ae_'+ str(num_topic) +'_' + str(iteration))

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [13:11<00:00, 157.89s/it]
  diff = np.log(self.expElogbeta)
100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [25:14<00:00, 302.64s/it]


### REVUE EI

**MODELE SAC DE MOTS**

**SANS LEMMATISATION**

In [10]:
#récupérer le corpus train
%store -r corpus_train_EI
dictionary = dictionary_EI_2 # Sans lemmatisation
temp = dictionary[0] 
id2word = dictionary.id2token

In [11]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_EI, dictionary = dictionary_EI_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/EI/lda_ei_'+ str(num_topic))

100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [10:06<00:00, 606.18s/it]


**AVEC LEMMATISATION**

In [None]:
#récupérer le corpus train
%store -r corpus_train_EI_lemma
dictionary = dictionary_EI_2_lemma
temp = dictionary[0] 
id2word = dictionary.id2token

In [None]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_EI_lemma, dictionary = dictionary_EI_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/EI/lda_ei_lemma_'+ str(num_topic))

**MODELE TF IDF**

**SANS LEMMATISATION**

In [None]:
#récupérer le corpus train
%store -r corpus_train_EI_tfidf
dictionary = dictionary_EI_2 # Sans lemmatisation
temp = dictionary[0] 
id2word = dictionary.id2token

In [None]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_EI_tfidf, dictionary = dictionary_EI_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/EI/lda_ei_tfidf'+ str(num_topic))

**AVEC LEMMATISATION**

In [None]:
#récupérer le corpus train
%store -r corpus_train_EI_tfidf_lemma
dictionary = dictionary_EI_2_lemma
temp = dictionary[0] 
id2word = dictionary.id2token

In [None]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_EI_tfidf_lemma, dictionary = dictionary_EI_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/EI/lda_ei_tfidf_lemma_'+ str(num_topic))

**ETUDE FIABILITE**

In [None]:
# Entraîner 5 modèles avec 10 thèmes et 5 modèles à 40 thèmes
for num_topic in [10,40]:
    for iteration in tqdm(range(5)):
        model = train_model_LDA(corpus=corpus_train_EI, dictionary = dictionary_EI_2, num_topic= num_topic, eval_every = None)
        model.save('Résultats_LDA/Fiabilité/lda_ei_'+ str(num_topic) +'_' + str(iteration))

### REVUE RI

**SANS LEMMATISATION**

In [12]:
#récupérer le corpus train
%store -r corpus_train_RI
dictionary = dictionary_RI_2 # Sans lemmatisation
temp = dictionary[0] 
id2word = dictionary.id2token

In [13]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_RI, dictionary = dictionary_RI_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/RI/lda_ri_'+ str(num_topic))

100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [09:01<00:00, 541.11s/it]


**AVEC LEMMATISATION**

In [9]:
#récupérer le corpus train
%store -r corpus_train_RI_lemma
dictionary = dictionary_RI_2_lemma
temp = dictionary[0] 
id2word = dictionary.id2token

In [None]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_RI_lemma, dictionary = dictionary_RI_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/RI/lda_ri_lemma_'+ str(num_topic))

**MODELE TF IDF**

**SANS LEMMATISATION**

In [5]:
#récupérer le corpus train
%store -r corpus_train_RI_tfidf
dictionary = dictionary_RI_2 # Sans lemmatisation
temp = dictionary[0] 
id2word = dictionary.id2token

In [6]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_RI_tfidf, dictionary = dictionary_RI_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/RI/lda_ri_tfidf'+ str(num_topic))

100%|███████████████████████████████████████████████████████████████████████████████████| 9/9 [25:25<00:00, 168.94s/it]


**AVEC LEMMATISATION**

In [7]:
#récupérer le corpus train
%store -r corpus_train_RI_tfidf_lemma
dictionary = dictionary_RI_2_lemma
temp = dictionary[0] 
id2word = dictionary.id2token

In [8]:
for num_topic in tqdm(num_topics):
    model = train_model_LDA(corpus=corpus_train_RI_tfidf_lemma, dictionary = dictionary_RI_2, num_topic= num_topic, eval_every = None)
    model.save('Résultats_LDA/RI/lda_ri_tfidf_lemma'+ str(num_topic))

100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [07:40<00:00, 152.71s/it]


**ETUDE FIABILITE**

In [None]:
# Entraîner 5 modèles avec 10 thèmes et 5 modèles à 40 thèmes
for num_topic in [10,40]:
    for iteration in tqdm(range(5)):
        model = train_model_LDA(corpus=corpus_train_RI, dictionary = dictionary_RI_2, num_topic= num_topic, eval_every = None)
        model.save('Résultats_LDA/Fiabilité/lda_ri_'+ str(num_topic) +'_' + str(iteration))