## Analyzing CSA's Lessons Learned | Analyse des leçons apprises par l'ASC 
### Step 3: Topic Modelling | Étape 3 : Modélisation des sujets  
This notebook takes the results of step 2 (translated text, sentiment analysis scores) and applies latent Dirichlet allocation (LDA) for topic modelling. 
This workflow could be adapted to any other spreadsheet or csv. All that is needed as input is a spreadsheet with a column of text in English, French, or both.  

Ce cahier reprend les résultats de l'étape 2 (texte traduit, analyse de sentiments) et y applique l'allocation de Dirichlet latente (LDA) pour faire la modélisation des sujets. 
Ce workflow pourrait être adapté à tout autre tableur ou csv. Tout ce qui est requis c'est un tableur qui contient une colonne de texte en anglais, français, ou les deux langues. 

Author/Auteur: N Fee, Canadian Space Agency/Agence spatiale canadienne, 2021-06-18 

In [1]:
import nltk
import pandas as pd 

#Libraries for topic modelling 
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import gensim 


import spacy #for lemmatization (ie. grouping together inflected forms of a word - studying and studious become study, etc...)  
import pyLDAvis.gensim_models # for visualizing the topics
import pyLDAvis

import operator #useful for dealing with topic tuples
import warnings #useful for disabling an annoying deprecation warning 

warnings.filterwarnings("ignore",category=DeprecationWarning) #otherwise one of the libraries keeps logging errors



In [2]:
infile = "2_Output/LessonsLearned_step2.xlsx"
outfile = "2_Output/LessonsLearned_step3.xlsx"

stopwords_en = nltk.corpus.stopwords.words("english") #stopwords are words that should be removed before analysis (generally because they are common - 'the', 'a', etc...)
stopwords_fr = nltk.corpus.stopwords.words("french") #same but in french
stopwords_bil = stopwords_en + stopwords_fr #same but including stopwords from english and french. This is for analysing the untranslated text

newStopWords_en = ['csa','project','projects','also', 'agency', 'space'] #New stopwords relevant for the CSA dataset
newStopWords_fr = ['asc','projet','projets', 'cette','agence', 'spatial']

#Add the new stopwords to the existing stopwords
stopwords_en.extend(newStopWords_en)
stopwords_fr.extend(newStopWords_fr)
stopwords_bil.extend(newStopWords_en+newStopWords_fr)
                     
#More user-friendly column names for the results. If you change the number of topics, you'll want to change this as well.                      
colnames = ['Topic 1', 'Topic 1 Probability',
            'Topic 2', 'Topic 2 Probability',
            'Topic 3', 'Topic 3 Probability',
            'Topic 4', 'Topic 4 Probability',
            'Topic 5', 'Topic 5 Probability',
            'Topic 6', 'Topic 6 Probability',
            'Top Topic', 'Top Topic Probability']
    

In [3]:
#Tokenizes the text (ie. each word is an element in a list - this is necessary for analysis)
def sentence_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True to remove punctuation
        
# Remove the stopwords from the tokenized text
def remove_stopwords(row,colname,stop_words):
    text = str(row[colname])
    text_clean = simple_preprocess(text)
    return [word for word in text_clean if word not in stop_words]

#Lemmatize the tokenized text (ie. grouping together inflected forms of a word - studying and studious become study, etc...)
def lemmatization(texts, nlp_lang, allowed_postags=['NOUN','VERB', 'ADJ','ADV']):
    text_ls = []
    for sent in texts:
        doc = nlp_lang(" ".join(sent)) 
        text_ls.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return text_ls

#Create a corpus and dictionary based on the lemmatized text (very simply put, prep the text for the lda model). More info about this here: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#core-concepts-corpus (external site in english only)   
def lemmatized_to_corpus(data_lemmatized):
    id2word = corpora.Dictionary(data_lemmatized)
    texts = data_lemmatized
    corpus = [id2word.doc2bow(text) for text in texts]
    return corpus, id2word

#Build the latent dirichlet allocation (LDA) model. This is being used for topic modelling. 
def build_lda(corpus,id2word,ntopics=10):
    lda_model = gensim.models.ldamodel.LdaModel(corpus= corpus,
                                                   id2word=id2word,
                                                   num_topics=ntopics, 
                                                   random_state=100,
                                                   update_every=1,
                                                   chunksize=100,
                                                   passes=100,
                                                   alpha='auto',
                                                   per_word_topics=True)
    return lda_model

#Compute the coherence score. 
def compute_coherence(ldamodel,data_lemmatized,id2word):
    coherence_model_lda = CoherenceModel(model=ldamodel, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    return coherence_lda

#Compute the perplexity score and print it and the coherence score 
def print_perplexity_coherence(ldamodel,corpus, title, data_lemmatized, id2word):
    coherence_lda = compute_coherence(ldamodel,data_lemmatized, id2word)
    print(title)
    print('Perplexity:      %6.3f'%ldamodel.log_perplexity(corpus))  # a measure of how good the model is. lower the better.
    print('Coherence Score: %6.3f \n'%coherence_lda)

# Based on a list of topics, determine the topic that the text probably falls under
def assign_topic(lda_model,row,corpus_col):
    topic_list = lda_model.get_document_topics(row[corpus_col], minimum_probability=0, minimum_phi_value=None, per_word_topics=False)    
    top_topic = max(topic_list, key=operator.itemgetter(1))
    topic_list.append(top_topic)
    
    topic_list = [(item[0],round(item[1],2)) for item in topic_list] #round the probabilities 
    topic_list  = [item for t in topic_list for item in t] #list of tuples to a flat list 
    return topic_list

In [4]:
df = pd.read_excel(infile)

In [5]:
#Remove stopwords
df['lesson_clean_en'] = df.apply(lambda row: list(remove_stopwords(row,'Lessons EN',stopwords_en)), axis=1)
df['lesson_clean_fr'] = df.apply(lambda row: list(remove_stopwords(row,'Lessons FR',stopwords_fr)), axis=1)
df['lesson_clean_bil'] = df.apply(lambda row: list(remove_stopwords(row,'Lesson Learned',stopwords_bil)), axis=1)
df['lesson_clean_bil'][:10]

0    [honoured, present, state, canadian, sector, r...
1    [efforts, soutien, innovation, collaboration, ...
2    [several, differing, programs, services, csin,...
3    [encourager, lancement, nouvelles, entreprises...
4    [research, development, expenditures, totalled...
5    [plan, proposes, commitment, fund, initial, th...
6    [réseau, innovation, canadien, risc, fournira,...
7    [better, reflect, current, best, practices, ma...
8    [plan, opérationnel, décrit, détails, personne...
9    [gouvernement, canada, appuie, depuis, longtem...
Name: lesson_clean_bil, dtype: object

In [6]:
# Initialize spacy 'en' and 'fr' models. You only need to do this once. 
#python -m spacy download fr_core_news_sm
#python -m spacy download en_core_web_sm


nlp_en = spacy.load("en_core_web_sm")
nlp_fr = spacy.load("fr_core_news_sm") 


# Do lemmatization keeping only noun, adj, vb, adv
df['lemmatized_en'] = lemmatization(df['lesson_clean_en'], nlp_en)
df['lemmatized_fr'] = lemmatization(df['lesson_clean_fr'], nlp_fr)

# Some fancy footwork to lemmatize the bilingual text
df['lemmatized_bil'] = lemmatization(df['lesson_clean_bil'], nlp_en)
df['lemmatized_bil'] = lemmatization(df['lemmatized_bil'], nlp_fr)

df['lemmatized_bil'][:10]


0    [honour, preser, state, canadian, report, cove...
1    [soutien, innovation, collaboration, manifeste...
2    [several, differ, program, service, csin, offe...
3    [encourager, lancement, nouveau, entreprise, s...
4    [development, expenditur, total, organization,...
5    [plan, propose, commitment, fund, initial, yea...
6    [réseau, innovation, canadien, risc, fournir, ...
7    [well, reflect, current, good, practic, matter...
8    [plan, opérationnel, décrire, détail, personne...
9    [gouvernement, canader, appuie, longtemp, init...
Name: lemmatized_bil, dtype: object

In [8]:
# Create corpora based on the lemmatized text (ie. a vector where each word is described by an ID and by the frequency it appears in the text)
df['corpus_en'],id2word_en = lemmatized_to_corpus(df['lemmatized_en'])
df['corpus_fr'],id2word_fr = lemmatized_to_corpus(df['lemmatized_fr'])
df['corpus_bil'],id2word_bil = lemmatized_to_corpus(df['lemmatized_bil'])
df['corpus_bil'][:10]

0    [(0, 2), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1...
1    [(3, 1), (30, 1), (31, 1), (32, 1), (33, 1), (...
2    [(11, 1), (77, 1), (78, 1), (79, 1), (80, 1), ...
3    [(49, 1), (56, 1), (57, 1), (67, 2), (71, 1), ...
4    [(0, 1), (20, 3), (26, 1), (91, 1), (122, 1), ...
5    [(39, 1), (59, 1), (79, 1), (83, 1), (90, 1), ...
6    [(16, 1), (37, 1), (50, 1), (64, 1), (66, 2), ...
7    [(0, 1), (4, 1), (24, 1), (126, 1), (142, 1), ...
8    [(49, 1), (55, 1), (57, 1), (59, 2), (64, 2), ...
9    [(38, 1), (47, 1), (50, 1), (67, 2), (71, 1), ...
Name: corpus_bil, dtype: object

In [9]:
# Build the topic models
lda_model_en = build_lda(df['corpus_en'], id2word_en, ntopics = 6)
lda_model_fr = build_lda(df['corpus_fr'], id2word_fr, ntopics = 6)
lda_model_bil = build_lda(df['corpus_bil'], id2word_bil, ntopics = 6)

# Determine how successful the topic models are (ideally we want low perplexity, high coherence)
print_perplexity_coherence(lda_model_en,df['corpus_en'],'ENGLISH', df['lemmatized_en'], id2word_en)
print_perplexity_coherence(lda_model_fr,df['corpus_fr'],'FRENCH', df['lemmatized_fr'], id2word_fr)
print_perplexity_coherence(lda_model_bil,df['corpus_bil'],'BILINGUAL', df['lemmatized_bil'], id2word_bil)


ENGLISH
Perplexity:      -6.147
Coherence Score:  0.429 

FRENCH
Perplexity:      -6.233
Coherence Score:  0.455 

BILINGUAL
Perplexity:      -6.314
Coherence Score:  0.487 



In [10]:
print('ENGLISH Topics')
print(lda_model_en.print_topics())

print('\nFRENCH Topics')
print(lda_model_fr.print_topics())

print('\nBILINGUAL Topics')
print(lda_model_bil.print_topics())


ENGLISH Topics
[(0, '0.029*"year" + 0.025*"increase" + 0.020*"new" + 0.019*"canadian" + 0.019*"innovation" + 0.019*"support" + 0.015*"revenue" + 0.015*"growth" + 0.015*"provide" + 0.015*"network"'), (1, '0.039*"export" + 0.027*"europe" + 0.027*"state" + 0.014*"market" + 0.014*"increase" + 0.014*"large" + 0.014*"continue" + 0.014*"organization" + 0.014*"year" + 0.014*"row"'), (2, '0.028*"sector" + 0.026*"organization" + 0.017*"year" + 0.016*"expenditure" + 0.016*"activity" + 0.016*"research" + 0.016*"canada" + 0.012*"support" + 0.012*"canadian" + 0.011*"patent"'), (3, '0.020*"service" + 0.015*"activity" + 0.015*"study" + 0.015*"hqp" + 0.015*"research" + 0.015*"sector" + 0.015*"canadian" + 0.015*"member" + 0.015*"program" + 0.010*"report"'), (4, '0.033*"sector" + 0.025*"year" + 0.017*"business" + 0.017*"access" + 0.017*"program" + 0.017*"facilitate" + 0.017*"report" + 0.009*"generally" + 0.009*"order" + 0.009*"measure"'), (5, '0.018*"sector" + 0.018*"canadian" + 0.012*"datum" + 0.012*"ri

In [11]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis_en = pyLDAvis.gensim_models.prepare(lda_model_en, df['corpus_en'], id2word_en, sort_topics = False)
pyLDAvis.save_html(vis_en, '2_Output/LessonsLearned_topics_EN.html')

vis_fr = pyLDAvis.gensim_models.prepare(lda_model_fr, df['corpus_fr'], id2word_fr,sort_topics = False)
pyLDAvis.save_html(vis_fr, '2_Output/LessonsLearned_topics_FR.html')

vis_bil = pyLDAvis.gensim_models.prepare(lda_model_bil, df['corpus_bil'], id2word_bil,sort_topics = False)
pyLDAvis.save_html(vis_bil, '2_Output/LessonsLearned_topics_BIL.html')

In [12]:
# Assign topics to each text based on the trained model
topics = pd.DataFrame(df.apply(lambda row: assign_topic(lda_model_en,row,'corpus_en'), axis=1), columns = ['topics'])

# Useful for formatting 
topics_df=  pd.DataFrame(topics['topics'].to_list(), columns=colnames)

topics_df[:10]



Unnamed: 0,Topic 1,Topic 1 Probability,Topic 2,Topic 2 Probability,Topic 3,Topic 3 Probability,Topic 4,Topic 4 Probability,Topic 5,Topic 5 Probability,Topic 6,Topic 6 Probability,Top Topic,Top Topic Probability
0,0,0.0,1,0.0,2,1.0,3,0.0,4,0.0,5,0.0,2,1.0
1,0,0.81,1,0.0,2,0.19,3,0.0,4,0.0,5,0.0,0,0.81
2,0,0.0,1,0.0,2,0.0,3,1.0,4,0.0,5,0.0,3,1.0
3,0,0.0,1,0.0,2,0.0,3,0.0,4,1.0,5,0.0,4,1.0
4,0,0.0,1,0.0,2,1.0,3,0.0,4,0.0,5,0.0,2,1.0
5,0,0.0,1,0.0,2,1.0,3,0.0,4,0.0,5,0.0,2,1.0
6,0,1.0,1,0.0,2,0.0,3,0.0,4,0.0,5,0.0,0,1.0
7,0,1.0,1,0.0,2,0.0,3,0.0,4,0.0,5,0.0,0,1.0
8,0,0.0,1,0.0,2,0.0,3,0.0,4,0.0,5,1.0,5,1.0
9,0,0.0,1,0.0,2,1.0,3,0.0,4,0.0,5,0.0,2,1.0


In [13]:
# Create a new dataframe that merges the initial dataframe with the new topics dataframe
df1 = pd.concat([df,topics_df], axis =1)
df1[:10]

Unnamed: 0.1,Unnamed: 0,Lesson Learned,Language,Lessons EN,Lessons FR,Negative Sentiment,Neutral Sentiment,Positive Sentiment,Compound Sentiment Score,lesson_clean_en,...,Topic 3,Topic 3 Probability,Topic 4,Topic 4 Probability,Topic 5,Topic 5 Probability,Topic 6,Topic 6 Probability,Top Topic,Top Topic Probability
0,0,I am honoured to present the State of the Cana...,en,I am honoured to present the State of the Cana...,J'ai l'honneur de présenter le rapport sur l'é...,0.0,0.932,0.068,0.6369,"[honoured, present, state, canadian, sector, r...",...,2,1.0,3,0.0,4,0.0,5,0.0,2,1.0
1,1,Les efforts de soutien à l’innovation et de co...,fr,Efforts to support innovation and collaboratio...,Les efforts de soutien à l’innovation et de co...,0.013,0.822,0.164,0.9184,"[efforts, support, innovation, collaboration, ...",...,2,0.19,3,0.0,4,0.0,5,0.0,0,0.81
2,2,There are several differing programs and servi...,en,There are several differing programs and servi...,Il existe plusieurs programmes et services dif...,0.0,0.95,0.05,0.5095,"[several, differing, programs, services, csin,...",...,2,0.0,3,1.0,4,0.0,5,0.0,3,1.0
3,3,Encourager le lancement de nouvelles entrepris...,fr,Encourage the launch of new businesses in the ...,Encourager le lancement de nouvelles entrepris...,0.0,0.86,0.14,0.8225,"[encourage, launch, new, businesses, sector, d...",...,2,0.0,3,0.0,4,1.0,5,0.0,4,1.0
4,4,Research and development (R&D) expenditures to...,en,Research and development (R&D) expenditures to...,Les dépenses de recherche et développement (R&...,0.0,0.922,0.078,0.6705,"[research, development, expenditures, totalled...",...,2,1.0,3,0.0,4,0.0,5,0.0,2,1.0
5,5,The plan proposes a commitment to fund the ini...,en,The plan proposes a commitment to fund the ini...,Le plan propose un engagement à financer les t...,0.0,0.858,0.142,0.7003,"[plan, proposes, commitment, fund, initial, th...",...,2,1.0,3,0.0,4,0.0,5,0.0,2,1.0
6,6,Le Réseau d’innovation spatial canadien (RISC)...,fr,The Canadian Space Innovation Network (RISC) w...,Le Réseau d’innovation spatial canadien (RISC)...,0.032,0.698,0.27,0.9072,"[canadian, innovation, network, risc, provide,...",...,2,0.0,3,0.0,4,0.0,5,0.0,0,1.0
7,7,To better reflect the current best practices a...,en,To better reflect the current best practices a...,Afin de mieux refléter les meilleures pratique...,0.0,0.82,0.18,0.8934,"[better, reflect, current, best, practices, ma...",...,2,0.0,3,0.0,4,0.0,5,0.0,0,1.0
8,8,Le plan opérationnel décrit les détails sur le...,fr,The operational plan describes the details of ...,Le plan opérationnel décrit les détails sur le...,0.0,0.977,0.023,0.3818,"[operational, plan, describes, details, person...",...,2,0.0,3,0.0,4,0.0,5,1.0,5,1.0
9,9,Le gouvernement du Canada appuie depuis longte...,fr,The Government of Canada has a long history of...,Le gouvernement du Canada appuie depuis longte...,0.04,0.734,0.226,0.8934,"[government, canada, long, history, supporting...",...,2,1.0,3,0.0,4,0.0,5,0.0,2,1.0


In [14]:
# Save the results
df1.to_excel(outfile,index=False, encoding="utf-8")