# Topic Modeling with Lord of The Rings trilogy

## Imports

In [22]:
from pathlib import Path
import regex as re

## Loading Data

Let's load the  data

In [17]:
DATA_PATH = Path("./data/01_fellowship.txt")

In [65]:
def read_data(file_path: str) -> str:
    ''' Reads utf-8 text file '''
    with open(Path(file_path), "r", encoding="utf-8") as f_out:
        data = f_out.read()

    return data

In [72]:
data = read_data(DATA_PATH)

## Preprocessing Data

Let's view the first 2000 characters of our text

In [73]:
data[:2000]

"_Chapter 1_\n            A Long-expected Party\n\n     When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton.\n     Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him _well_-preserved, but _unchanged_ would have been nearer the mark. There were some that shook their heads and thought this was too much of a 

We need to split the book into list of chapters. Just by looking through the book, each chapter starts with "\_Chapter \*\_". We will use regular expression, `regex` to create a filter: `"\_Chapter\s\d+\_"`

In [74]:
# create a regex filter for splitting chapters
chapter_filter = "\_Chapter\s\d+\_"

In [115]:
# split the book into a list of chapters 
chapters = re.split(chapter_filter, data)

Let's make sure that we have correctly split the chapters by viewing the first 80 characters of each chapter

In [124]:
# get the number of chapters extracted
len(chapters)

23

There are 23 elements in the list of chapters, but we can see below that the Element 0 is empty spaces and the first chapter starts with Element 1 

In [125]:
for no, chapter in enumerate(chapters):
    print(f"Element {no}:", chapter[:40])

Element 0: 
Element 1: 
            A Long-expected Party

    
Element 2: 
            The Shadow of the Past

   
Element 3: 
            Three is Company

     'You
Element 4: 
            A Short Cut to Mushrooms

 
Element 5: 
            A Conspiracy Unmasked

    
Element 6: 
            The Old Forest

     Frodo 
Element 7: 
            In the House of Tom Bombadi
Element 8: 
            Fog on the Barrow-Downs

  
Element 9: 
            At the Sign of
 The Prancin
Element 10: 
            Strider

     Frodo, Pippin
Element 11: 
            A Knife in the Dark

     A
Element 12: 
            Flight to the Ford

     Wh
Element 13: 
            Many Meetings

     Frodo w
Element 14: 
            The Council of Elrond

    
Element 15: 
            The Ring Goes South

     L
Element 16: 
            A Journey in the Dark

    
Element 17: 
            The Bridge of Khazad-dûm

 
Element 18: 
            Lothlórien

     'Alas! I F
Element 19: 
            The Mirror of Gala

We know that the "last" chapter of The Fellowship of the Ring is "The Breaking of the Fellowship" (technically it's Chapter 10 of book II but there are 12 chapters in book I) 

## Create Embeddings

In [126]:
from sentence_transformers import SentenceTransformer

In [127]:
model = SentenceTransformer('all-mpnet-base-v2')

Downloading: 100%|██████████| 1.18k/1.18k [00:00<00:00, 468kB/s]
Downloading: 100%|██████████| 190/190 [00:00<00:00, 82.9kB/s]
Downloading: 100%|██████████| 10.1k/10.1k [00:00<00:00, 3.63MB/s]
Downloading: 100%|██████████| 571/571 [00:00<00:00, 227kB/s]
Downloading: 100%|██████████| 116/116 [00:00<00:00, 44.9kB/s]
Downloading: 100%|██████████| 39.3k/39.3k [00:00<00:00, 160kB/s] 
Downloading: 100%|██████████| 349/349 [00:00<00:00, 192kB/s]
Downloading: 100%|██████████| 438M/438M [00:13<00:00, 32.1MB/s] 
Downloading: 100%|██████████| 53.0/53.0 [00:00<00:00, 23.7kB/s]
Downloading: 100%|██████████| 239/239 [00:00<00:00, 99.6kB/s]
Downloading: 100%|██████████| 466k/466k [00:01<00:00, 376kB/s]  
Downloading: 100%|██████████| 363/363 [00:00<00:00, 103kB/s]
Downloading: 100%|██████████| 13.1k/13.1k [00:00<00:00, 3.71MB/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 236kB/s]  


In [128]:
embeddings = model.encode(chapters, show_progress_bar=True)

    Found GPU0 Quadro K5200 which is of cuda capability 3.5.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 3.7.
    
Batches:   0%|          | 0/1 [00:00<?, ?it/s]


RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.