# Goal of the TP

In this TP, we use two topic modelling methods.
The goal of topic modelling is to describe a set of documents in terms of a set of topics (unknown a priori) such that each document is more or less strongly correlated with each topic and that each topic is more or less strongly correlated with each element of the vocabulary ("form", or "word").
More specifically, we use here the scikit-learn library to train and then analyse a Latent Semantic Analysis (LSA) model and a Latent Dirichlet Allocation (LDA) model.

We use a "real" dataset consisting of the English subtitles of the *Game of Thrones* TV series but also an artificial dataset this is particularly well suited to topic modelling.
This artificial dataset is useful to test the code and have an idea of the results of the different methods in an ideal case.

In [None]:
use_controlled_dataset = True
#use_controlled_dataset = False

# Creation of a controlled dataset

In order to more easily test the topic modelling methods that we are about to implement, we here create an artificial dataset consisting of "texts" that we know correspond to a mixture of well-defined "topics".
A topic modelling algorithm should be able to recognise the topics in these documents fairly well.

The topics are coded as the integer from 0 to `(n_artificial_topics-1)` (i.e. 15), or, equivalently, the letters from "A" (for 0) to "P" (for 15).
Each topic is associated a small set of forms, all obtained by repeating the letter corresponding to the topic (e.g. "B" or "BBB" for topic "B").
Each of the `n_docs=40` documents is assigned a topic `main_topic` and consists of `doc_length=2000` tokens. These tokens are not neccessarily occurrences of a form associated with `main_topic`, but each is an occurrence of a form associated with a topic selected randomly with a probability decreasing with the distance between the topic and `main_topic`.

For instance, the eighth document (named "doc_C_7") is assigned topic "C"; it is therefore likely to contain a lot of tokens in "C", a bit less of tokens in "B" or "D", a bit less of tokens in "A" or "E", etc.

The dataset is encoded as a list `dataset` that contains each document as a dictionary.
This dictionary contains:
*   an identifier (*str_id*) of the form "doc_x_yy" for "x" is the letter of its main topic and "yy" is a unique numerical identifier for the document;
*   a string (*raw_text*), the content of the document.

In [None]:
import numpy as np
import random

if(use_controlled_dataset):
  n_docs = 40 # Number of documents.
  doc_length = 2000 # Length of each document.
  n_topics = 16 # Number of topics.

  # Returns a character.
  # topic: int
  def topic_to_letter(topic):
    return chr(65 + topic) # 65 corresponds to "A".

  dataset = []
  for i in range(n_docs): # For each document.
    main_topic = int(i  * n_topics / n_docs) # Main topic of the document.
    print(f"{topic_to_letter(main_topic)} ({main_topic})", end=", ", flush=True)

    tokens = [] # list[str]
    for j in range(doc_length): # For each token.
      # Selects a topic for the token, likely to be `main_topic`, or one very near.
      topic = main_topic + int(np.random.randn() * 1.5)
      while(topic < 0): topic += n_topics
      topic = topic % n_topics

      # Generates the token.
      #token_length = np.floor(1 + 8 * np.random.uniform()) # Between 1 and 8.
      token_length = np.floor(9 - 8 * np.power(np.random.uniform(), 2)) # Between 1 and 8; more likely to be large than small.
      token = topic_to_letter(topic) * int(token_length)

      tokens.append(token)

    dataset.append({"str_id": f"doc_{topic_to_letter(main_topic)}_{i:02}", "raw_text": " ".join(tokens)})
else:
  print("[controlled dataset not in use]")

In [None]:
if(use_controlled_dataset):
  print(dataset[-1]["str_id"]) # Name of the last artificial document.
else:
  print("[controlled dataset not in use]")

In [None]:
if(use_controlled_dataset):
  print(dataset[7]["str_id"]) # Name of the 8th artificial document.
  print(dataset[7]["raw_text"]) # Text of the 8th artificial document. It should be composed mainly of tokens in C, then tokens in B or D, then tokens in A or E, etc.
else:
  print("[controlled dataset not in use]")

# Getting the Game of Thrones dataset

## Downloading and extracting the dataset

In [None]:
import os
import urllib # To download files.
import zipfile # To unzip files.

zip_url = "https://moodle.u-paris.fr/mod/resource/view.php?id=781987"
zip_filename = "data.zip"
data_dirname = zip_filename.split(".")[0] # Name of the directory in which the dataset is/will be.

if(not use_controlled_dataset):
  if(os.path.isdir(data_dirname)):
    print("Dataset found.")
  else:
    # Downloads the dataset.
    tmp = urllib.request.urlretrieve(zip_url)
    filename = tmp[0]
    print(f"Dataset downloaded to '{filename}'.")

    # Extracts the dataset.
    with zipfile.ZipFile(filename, 'r') as zip_ref:
      zip_ref.extractall(".", )
    assert os.path.isdir(data_dirname)
    print(f"Dataset extracted to '{data_dirname}'.")
else:
  print("[controlled dataset in use]")

In [None]:
if(not use_controlled_dataset):
  print(dataset[-1]["str_id"]) # Name of the last document.
else:
  print("[controlled dataset in use]")

## Reading the Game of Thrones dataset

The dataset contains one SRT file per episodes of the Game of Thrones series, grouped by season (one subdirectory per season).

(The syntax of STR files is described here: https://docs.fileformat.com/video/srt/)

As for the artificial dataset, this dataset is encoded as a list `dataset` that contains each document as a dictionary.
This dictionary contains:
*   an identifier (*str_id*) of the form "doc_x_yy" for "x" is the letter of its main topic and "yy" is a unique numerical identifier for the document;
*   a string (*raw_text*), the concatenation of the subtitles of the corresponding episode.

In [None]:
if(not use_controlled_dataset):
  dataset = [] # This will be a list of dictionaries.

  # file_path: str
  # Returns a string.
  def read_srt_file(file_path):
      lines = []
      with open(file_path) as f:
          c = True
          while(c):
              s = f.readline() # This should be the end of the file, an empty line or a number.
              if(s == ""): c = False # The end of the file has been reached.
              if(s.strip() == ""): continue # End of the file or empty line.

              f.readline() # We can throw away the timing that follows.

              # All the next non-empty lines are character lines.
              s = f.readline().strip()
              while(s != ""):
                  if(not s.startswith("#")): lines.append(s) # Ignores lines that start with '#'.
                  s = f.readline().strip()

      return ' '.join(lines)

  for path, dirs, files in os.walk(data_dirname): # Iterates through every subdirectories.
      path_parts = path.split(os.path.sep)
      if(len(path_parts) == 1): # If we are in the root directory.
        continue # There is no subtitle file in the root directory, only some 'info.txt' file.
      season = path_parts[1] # 'Sxx' where 'xx' is the season number.
      #print(season)

      for file in files:
          file_parts = file.replace(' ', '.').split(".") # Most files have a name that starts with 'Game.of.Thrones.SxxEyy.' where 'xx' is the season number and 'yy' the episode number, but some filenames use ' ' instead of '.'.
          str_id = file_parts[3] # 'SxxEyy' where 'yy' is the episode number.
          print(str_id, end= ", ")

          file_path = os.path.join(path, file)
          #print(file_path)

          dataset.append({"str_id": str_id, "raw_text": read_srt_file(file_path)})
else:
  print("[controlled dataset in use]")

In [None]:
if(not use_controlled_dataset):
  dataset = sorted(dataset, key=(lambda x: x["str_id"])) # Sorts the episode chronologically (via their season and episode number).

  for episode in dataset: print(episode["str_id"], end=", ")
else:
  print("[controlled dataset in use]")

In [None]:
if(not use_controlled_dataset):
  print(f'{dataset[0]["str_id"]}:')
  print(dataset[0]["raw_text"][:200]) # Beginning of the first episode.
  print("[…]")
  print(dataset[0]["raw_text"][-200:]) # End of the first episode.
else:
  print("[controlled dataset in use]")

# Preprocessing

In this subsection, the dictionnary for each document is enriched, most notably with *processed_tokens*, a list of tokens that will later be converted into a bag-of-words vector.

This list is obtained from the subtitles through a few steps: (i) normalisation/simplification of the text, (ii) tokenisation, (iii) filtering of stop words (i.e. words that are mostly irrelevant to the problem; in our case — topic modelling —, stop words are typically determiners, coordinating or subordinating conjunctions, but others are possible.).

This bloc of code defines, if necessary, the set of stop words (`stopwords`) used afterwards.

In [None]:
if(use_controlled_dataset):
  filter_stopwords = False
else:
  filter_stopwords = True
  #filter_stopwords = False

language = "english"
#language = "french"

import nltk

if(filter_stopwords):
  try:
      print(f"NLTK stop words: {nltk.corpus.stopwords.words(language)}") # This might fail if "stopwords" is missing.
  except:
      nltk.download('stopwords')
      print(f"NLTK stop words: {nltk.corpus.stopwords.words(language)}")

  stopwords = set()
  stopwords.update(set(nltk.corpus.stopwords.words(language)))

  # Additional stop words.
  if(language == "english"): stopwords.update({})
  elif(language == "french"): stopwords.update({"a", "si", "plus", "fait", "faire", "ça", "tout", "tous", "toute", "toutes", "ce", "celui", "ceux", "celle", "celles", "son", "sa", "ses", "leur", "leurs", "tu", "dit", "oui", "non", "si", "alors", "ne", "être", "avoir", "faut", "veux", "i", "ici", "là", "où", "quand", "veut", "peut", "il", "ils", "elle", "elles", "mais", "ou", "et", "donc", "car"})

  print(f"Stop words used: {stopwords}")

This bloc of code makes sure that the tokeniser (`nltk.word_tokenize`) is available.

In [None]:
try:
    print(nltk.word_tokenize("NLTK tokeniser ready.")) # This might fail if "punkt" is missing.
except:
    nltk.download('punkt') # Necessary to use nltk.word_tokenize.
    print(nltk.word_tokenize("NLTK tokeniser ready."))

This bloc of code defines, if necessary, a stemmer (stemmer).

In [None]:
#stem_words = True
stem_words = False

if(stem_words):
    stemmer = nltk.stem.snowball.SnowballStemmer(language) # https://www.nltk.org/api/nltk.stem.snowball.html#nltk.stem.snowball.SnowballStemmer

    for w in ["running", "mangeront"]: # One English word and one French word.
      print(f"{w} -> {stemmer.stem(w)}")

This bloc of code defines the preprocessing function.
This is the function that implements the three steps mentioned above (normalisation/simplification, tokenisation, filtering).

When using the controlled dataset, only the tokenisation step is relevant.
You can implement the other two later.
When using the Game of Throne dataset, the normalisation/simplification process should at least convert the text to lower case and remove punctuation marks and the HTML tags that subtitles sometimes contain.
Additional operations are possible, both in the normalisation/simplification step and after the tokenisation step.

In [None]:
import re # To use regular expressions (regexes). https://docs.python.org/3/library/re.html https://docs.python.org/3/howto/regex.html

# Returns a pair composed of (i) a string and a list of tokens.
# text: str
def preprocess(text):
    # (i) Normalisation/simplification step
    tmp = text
    tmp = re.sub('<.*?>', '', tmp) # Removes anything that looks like an HTML tag.
    ## TODO
    processed_text = ()

    # (ii) Tokenisation step
    ## TODO
    tokens = ()

    # (iii) Filtering step
    ## TODO
    tokens = ()

    return (processed_text, tokens)

# Test
print(preprocess("Hello Jon Snow, how are you?  I'm fine thanks,   what about you ? Please count to 3. 1, 2, 3. Good."))

This bloc of code actually processes the documents using the function defined above.

In [None]:
for document in dataset:
    print(document["str_id"], end=", ", flush=True)

    (text, tokens) = preprocess(document["raw_text"])
    document["processed_text"] = text
    document["processed_tokens"] = tokens

In [None]:
print(f'{dataset[0]["str_id"]}:')
print(dataset[0]["processed_tokens"][:20])
print("[…]")
print(dataset[0]["processed_tokens"][-20:])

# Creation of the document-form matrix

The aim of this section is to create a document-form matrix `matrix` that indicates, for a selected set of forms (e.g. words), for each document, the number of occurrences of each of these forms in the document.

This is done through three steps: (i) counting of the frequency of all forms in each document, (ii) creation of a vocabulary that includes or not forms based on their document-frequency (i.e. the proposition of documents in which they occur at least once; the goal being of filtering too rare and too common forms), (iii) creation of the document-form matrix restricted to the vocabulary thus determined.

## Counting of the frequency of all forms in each document

This bloc of code counts, (i) for each form, the number of documents in which it appears, and (ii) for each document, the number of occurrences (i.e. frequency) of each form.

In [None]:
from collections import Counter # https://docs.pythhttps://www.gutenberg.org/cache/epub/6838/pg6838.txton.org/3/library/collections.html#collections.Counter

document_form_counts = Counter() # Form each form, the number of documents it occurs in.
for document in dataset:
    print(document["str_id"], end=", ", flush=True)

    document["counts"] = Counter(document["processed_tokens"]) # For each form, the number of its occurences in the document.
    document_form_counts.update(set(document["processed_tokens"]))
print()

print(document_form_counts)

## Creation of a vocabulary

We here define the vocabulary used afterwards.
This vocabulary contains the forms with a document-frequency above a given lower bound `min_df` and below a given higher bound `max_df`.
This vocabulary is encoded as a dictionary `form2id`, that associates an integer identifier to each form, and, conversely, a list `id2form`, that associates its form to each identifier.

In [None]:
# Feel free to play with these values. Some probably yield better results than others, depending on the data.
max_df = 0.9 # Upper limit for the document frequency of a form.
min_df = 0.06 # Lower limit for the document frequency of a form.

# TODO
form2id = () # From form (str) to id (int).
id2form = () # From id (int) to form (str).


print(f"Number of forms: {len(id2form)}")

## Creation of the document-form matrix

We here define a document-form matrix as a bidimensional Numpy array.
Lines are indexed by documents, columns by forms.

In [None]:
import numpy as np

# TODO
matrix = ()


print(matrix.shape)
print(matrix)

doc_lengths = matrix.sum(axis=1) # For each document, its length.
term_frequency = matrix.sum(axis=0) # For each form, its frequency (count).

# Latent Semantic Analysis (LSA)

## Learning a model

In [None]:
%%time

from sklearn.decomposition import TruncatedSVD

if(use_controlled_dataset): lsa_n_topics = n_topics
else: lsa_n_topics = 32 # Find a good value by trial and error.

assert lsa_n_topics <= len(dataset) # With LSA, their cannot be more topics than documents.
lsa_model = TruncatedSVD(n_components=lsa_n_topics, n_iter=10)

print("Fitting the model…", end="", flush=True)
lsa_model.fit(X=matrix)
print(" Done!")

## Analysis

In [None]:
topic_form_corr = lsa_model.components_ # Contains a description of each topics in terms of the forms (positive/negative values are interpreted as positive/negative correlations).
print(topic_form_corr.shape) # (#topics, #forms)
print(topic_form_corr)

In [None]:
document_topic_corr = lsa_model.transform(matrix) # Contains a description of each documents in terms of the topics (positive/negative values are interpreted as positive/negative correlations).
print(document_topic_corr.shape) # (#documents, #topics)
print(document_topic_corr)

The goal here is to find, for each topic, the forms of max and min weight.

When using the controlled dataset (and a relevant value for `lsa_n_topics`), most topics should be strongly correlated with forms all using letters close to each other in the alphabet (note that during the generation of the dataset, we have considered the alphabet to be circular, in the sense that its end and its beginning are considered to be adjacent), and/or strongly anti-correlated with forms all using letters close to each other in the alphabet.

In [None]:
n_forms = 10 # For each topic, the `n_forms` forms the most correlated with the topic, and the `n_forms` forms the most anti-correlated with the topic, are displayed.
def lsa_show_topic(topic_id):
    print(f"Topic n°{topic_id}:")

    topic_vector = topic_form_corr[topic_id]
    #print(topic_vector) # From id (int) to score (float).

    # TODO
    positive_corr_id = ()
    negative_corr_id = ()

    print(f"positive correlation: {[(id2form[i], topic_vector[i]) for i in positive_corr_id]}")
    print(f"negative correlation: {[(id2form[i], topic_vector[i]) for i in negative_corr_id]}")

    print()

for topic_id in range(lsa_n_topics): lsa_show_topic(topic_id) # The first topics tend to be the most important.

# Latente Dirichlet Allocation

## Learning a model

In [None]:
%%time

from sklearn.decomposition import LatentDirichletAllocation

if(use_controlled_dataset): lda_n_topics = n_topics
else: lda_n_topics = 32 # Find a good value by trial and error.

lda_model = LatentDirichletAllocation(n_components=lda_n_topics, max_iter=20, n_jobs=-1)

print("Fitting the model…", end="", flush=True)
lda_model.fit(X=matrix)
print(" Done!")

topic_term_dists = lda_model.components_ / np.expand_dims(lda_model.components_.sum(axis=1), axis=-1) # Contains, for each topic, the probability distribution of generation over forms.
doc_topic_dists = lda_model.transform(matrix) # Contains, for each topic, the probability distribution over topics.

## Analysis

The goal here is to find, for each topic, the forms of max probability.

Some topics might be highly "spread out", in the sense that they have a more or less uniform distribution over the vocabulary.
These topics are not very informative and one can ignore them later.

When using the controlled dataset (and a relevant value for `lda_n_topics`), most topics should generate forms all using letters close to each other in the alphabet.

In [None]:
n_forms = 20 # For each topic, the `n_forms` most probably forms are displayed.
def lda_show_topic(topic_id):
    print(f"Topic n°{topic_id}:")

    topic_vector = topic_term_dists[topic_id]
    #print(topic_vector) # From id (int) to score (float).

    # TODO
    significant_id = ()

    print([(id2form[i], topic_vector[i]) for i in significant_id])

    print()

for topic_id in range(lda_n_topics): lda_show_topic(topic_id)

For each document, we look at its main topics: the one are the most strongly associated with it (i.e. that are the most strongly involved in the generation of the document according to the model).
The set of all main topics (for all documents) is recorded in `major_topics`.

When using the controlled dataset (and a relevant value for `lda_n_topics`), the first main topic of each document should almost always match the letter in the name of the document (this can be checked by looking in the block just above or alternatively in the block below).

In [None]:
major_topics = set()

for (document_id, document_vector) in enumerate(doc_topic_dists):
    #print(f"Document n°{document_id}:")
    print(f"{dataset[document_id]['str_id']}:")

    #print(document_vector) # From id (int) to score (float).

    sorted_id = document_vector.argsort() # https://numpy.org/doc/stable/reference/generated/numpy.argsort.html
    sorted_id = np.flip(sorted_id) # https://numpy.org/doc/stable/reference/generated/numpy.flip.html
    major_id = sorted_id[:n_forms]
    major_id = [i for i in major_id if(document_vector[i] > (0.1 * document_vector.max()))]
    #major_id = [i for i in major_id if(document_vector[i] > (1.1 * document_vector.min()))]

    print(major_id)
    print([document_vector[i] for i in major_id]) # Prints the probability within this document of each major topic.
    major_topics.update(major_id)

    print()

In [None]:
print(major_topics)
print()

for topic_id in major_topics: lda_show_topic(topic_id)

We here create a graph that shows the evolution of the topics in the documents of the dataset.

When using the controlled dataset (and a relevant value for `lda_n_topics`), each topic should clearly "peak" on a small region of documents and quickly disappear outside of it.

In [None]:
import matplotlib.pyplot as plt

#major_topics_only = True
major_topics_only = False

fig, ax = plt.subplots()
fig.set_size_inches(18.5, 10.5)

time = np.arange(len(dataset))
for i, topic_evolution in enumerate(np.transpose(lda_model.transform(matrix))):
    if(major_topics_only and (i not in major_topics)): continue # Filters out the topics that are not major ones.
    ax.plot(time, topic_evolution, label=f"topic {i}")

ax.legend(loc='lower right')
plt.show()

We use here the pyLDAvis library (https://pyldavis.readthedocs.io/en/latest/readme.html) to visualise the LDA model.

The interface mainly consist of a left pannel, that spatially organises the topics, and of a right pannel, that shows the vocabulary distribution of one of them.
When a topic is selected, the right pannel shows the forms that are the most "relevant" to this topic.
The notion of relevance used dependens on a parameter λ that can be manually adjusted on top of the pannel and that is described at the bottom of the pannel.
With λ=1, the most relevant forms are the ones that are the most frequently generated by the topic.
With λ=1, the most relevant forms are the ones that are the most specific to the topic.
(Remember that according to the LDA model, multiple topics may generate tokens of the same form.)

Note that the numbering of the topics here is not necessarily the one used above.
This tool sorts the topics by decreasing order of importance in the dataset, the importance of a topic being the number of tokens it has generated in the corpus according to the LDA model.

When using the controlled dataset (and a relevant value for `lda_n_topics`), the topics should roughly be of equal size, rarely overlap, and consists of forms almost all using the same letter.

In [None]:
try:
    import pyLDAvis
except:
    !pip install pyLDAvis==3.3.1
    import pyLDAvis

vis_data = pyLDAvis.prepare(topic_term_dists=topic_term_dists, doc_topic_dists=doc_topic_dists, doc_lengths=doc_lengths, vocab=id2form, term_frequency=term_frequency, mds='mmds') # https://pyldavis.readthedocs.io/en/latest/modules/API.html#pyLDAvis.prepare
pyLDAvis.display(vis_data) # Warning: This tool numbers the topics according to their importance in the corpus (in term of number of tokens). It is possible to recognise a topic by looking at the list of forms displayed for λ=1. The list displayed for λ=0 shows the forms that are the most specific to the selected topic.