In [221]:
from collections import defaultdict
import os
from gensim import corpora
from gensim import similarities
from gensim import models
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/beelzebruno/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Latent Semantic Analysis with Star Wars movie scripts

This is a very interesting study of natural language processing and the application of the Latent Semantic Analysis technic for text based search engine with modern AI solutions.

I have collected the full Star Wars movies scripts from the episodes I to VI at https://imsdb.com/search.php and grouped the documents as text files and separated in two different directories grouping the movie scripts by its chronbological order, the episodes I, II and III in a folder named `sw_prequels` and the first original trilogy (IV, V, VI) in a folder named `sw_original_trilogy`, giving a structure like this:

```
.
├── lsa.ipynb
├── sw_original_trilogy
│   ├── IV_a_new_hope.txt
│   ├── VI_return_of_the_jedi.txt
│   └── V_the_empire_strikes_back.txt
└── sw_prequels
    ├── III_revenge_of_the_sith.txt
    ├── II_attack_of_the_clones.txt
    └── I_the_phantom_menace.txt

2 directories, 7 files
```

The point is to fit the algorithm with the content of the texts from the prequel movies, located in a directory and the content of the texts from the original trilogy in the another directory and when the algotithm receive a text query as input it has to match the inputed text to both prequel and original trilogy, resulting on the probability of the existance of the text in one of the directories.

In [248]:
# Start by listing the current directory
os.listdir()

['sw_prequels', 'sw_original_trilogy', 'lsa.ipynb', '.ipynb_checkpoints']

In [249]:
# List the files inside the target directories
print(os.listdir('sw_prequels'))
print(os.listdir('sw_original_trilogy'))

['III_revenge_of_the_sith.txt', 'II_attack_of_the_clones.txt', '.ipynb_checkpoints', 'I_the_phantom_menace.txt']
['V_the_empire_strikes_back.txt', 'VI_return_of_the_jedi.txt', '.ipynb_checkpoints', 'IV_a_new_hope.txt']


In [37]:
# Put all files of each directory as string content in separated lists
prequels = []
trilogy = []

for file in os.listdir('sw_prequels'):
    if not file.endswith('.txt'): continue
    with open(f'sw_prequels/{file}', 'r') as f:
        prequels.extend(f.read().split('\n'))

for file in os.listdir('sw_original_trilogy'):
    if not file.endswith('.txt'): continue
    with open(f'sw_original_trilogy/{file}', 'r') as f:
        trilogy.extend(f.read().split('\n'))

# Remove the blank strings '' and tabs \t left from line spliting
prequels = [line.replace('\t', '').lower() for line in prequels]
prequels = list(filter(('').__ne__, prequels))

trilogy = [line.replace('\t', '').lower() for line in trilogy]
trilogy = list(filter(('').__ne__, trilogy))

In [38]:
# Total lines of each sequence
print(len(prequels))
print(len(trilogy))

9159
13823


In [39]:
# Preview some sentences fom the prequels
prequels[100:110]

['artoo squeals in a panic. on the view screen artoo\'s squeal reads out, "we\'re not going to make it."',
 'anakin: wrong thought, artoo.',
 'anakin slips through the narrow gap. the trailing vulture droid fighters crash.',
 "anakin: (continuing) i'm through.",
 'obi-wan continues to fire on the vulture droid fighters, driving them into the explosion.',
 'a clone fighter is hit and explodes, spewing debris. the clone pilot spins off into space.',
 'finally, obi-wan peels off and swings around, pulling up alongside anakin. clone fight squad seven battles the droids.',
 'odd ball: there are too many of them.',
 "clone pilot 2: i'm on your wing. break left. break left. they're all over me. get them off my . . .",
 "anakin: i'm going to go help them out!"]

In [40]:
# Preview some sentences from the OT
trilogy[100:110]

['solo?',
 'han',
 'no sign of life out there, general.',
 "the sensors are in place.  you'll ",
 'know if anything comes around.',
 'rieekan',
 'commander skywalker reported in yet?',
 'han',
 "no.  he's checking out a meteorite ",
 'that hit near him.']

In [215]:
# transform all prequel movie scripts into one single document (a big string)
prequels_doc = ' '.join(prequels)
# same to the original trilogy
trilogy_doc = ' '.join(trilogy)
# merge both documents in list of two rows, one for each document
docs = [prequels_doc, trilogy_doc]
print(len(docs))

2


In [250]:
# The index 0 is a prequel content, and the index 1 is from original trilogy
translate = {
    0: 'PREQUELS',
    1: 'ORIGINAL TRILOGY'
}

In [216]:
# Removing the stop words
txts = [[word for word in document.lower().split()
         if word not in stop_words]
         for document in docs]

In [220]:
# Calculating frequency
frequency = defaultdict(int)

for text in txts:
    for token in text:
        frequency[token] += 1

# Removing unique tokens
txts = [[token for token in text if frequency[token] > 1] for text in txts]

In [222]:
# Use gensim to get a numeric vector representation of the texts
gensim_dictionary = corpora.Dictionary(txts)
gensim_corpus = [gensim_dictionary.doc2bow(text) for text in txts]

In [223]:
# There is only two topics: The Sequel and The OT
lsi = models.LsiModel(gensim_corpus, id2word=gensim_dictionary, num_topics=2)

In [259]:
def get_result(query):
    # create a bag of words from query input
    vec_bow = gensim_dictionary.doc2bow(query.lower().split())
    vec_lsi = lsi[vec_bow]

    # transforming corpus to LSI space and index it
    index = similarities.MatrixSimilarity(lsi[gensim_corpus])

    # Perform a similarity query against the corpus
    simil = index[vec_lsi]  
    simil = sorted(list(enumerate(simil)), key=lambda item: -item[1])

    topic1, topic2 = simil
    result = {
        translate[topic1[0]]: topic1[1],
        translate[topic2[0]]: topic2[1]
    }
    return result

In [277]:
# Some test messages
query_inputs = [
    'yes captain',  # prequels (Clones acknowledgement (II, III))
    'you where the chosen one',  # prequels Obi-Wan to Anakin (III)
    'no i am your father',  # OT (Darth Vader to Luke (V))
    'this is the fastest ship of the galaxy',  # OT Han Solo about the Millenium Falcon
    'its your imagination kid, come on lets have sme optimism here',  # OT Han Solo to Luke (VI)
    'of course i know him, its me',  # OT (Ben Kenobi to Luke (IV))
    'this is a system we cannot afford to loose',  # prequels (Jedi council meeting (III))
    'this will make a fine addition to my collection',  # prequels (General Grievous getting a Jedi lightsaber (III))
    'an elegant weapon for a more civilized age',  # OT (Ben Kenobi giving Luke his father lightsaber (IV))
    'red five standing by',  #  OT (Luke as X-wing pilot in rebel attack against the deathstar (IV))
    'with a million more well on the way',  # prequels (Kaminoan to Obi-Wan (II))
    'i will rearrange the senate in a new galactic empire'  # prequels (Palpatine electing himself as Emperor) (III)
]

In [279]:
for query in query_inputs:
    out = get_result(query)
    print(query)
    for key, value in out.items():
        print(f'{key}: {value}')
    print('-'*25)

yes captain
PREQUELS: 0.9981762170791626
ORIGINAL TRILOGY: 0.36091670393943787
-------------------------
you where the chosen one
PREQUELS: 0.8601295351982117
ORIGINAL TRILOGY: 0.74738609790802
-------------------------
no i am your father
ORIGINAL TRILOGY: 0.9811432361602783
PREQUELS: 0.11409109830856323
-------------------------
this is the fastest ship of the galaxy
ORIGINAL TRILOGY: 0.8635291457176208
PREQUELS: 0.7429161071777344
-------------------------
its your imagination kid, come on lets have sme optimism here
ORIGINAL TRILOGY: 0.9800994396209717
PREQUELS: 0.48702582716941833
-------------------------
of course i know him, its me
ORIGINAL TRILOGY: 0.8212767839431763
PREQUELS: 0.7931703925132751
-------------------------
this is a system we cannot afford to loose
PREQUELS: 0.9442289471626282
ORIGINAL TRILOGY: 0.6007170081138611
-------------------------
this will make a fine addition to my collection
PREQUELS: 0.8419355154037476
ORIGINAL TRILOGY: 0.76996248960495
-------------

For this simple quotes the search engine seems to be pretty accurate in figure out from which season the message belongs to, but it may comes to miss or wrongly classify a lot o inputs. LSA algorithms seems to works better on very large datasets, and also for many other applications as well.

## References

Based on this code recipe: https://www.projectpro.io/recipes/explain-similarity-queries-gensim