Step 6 - Extract the most important content from caption text

The most important words from dialogues can in some ways show the gist of what a game is talking about. This script is used to extract some of such words using nltk.

Prerequisites: nltk

Input:

1. caption text

Output:

1. A list of words and their corresponding vectors

In [2]:
%pylab
import os, glob
import json
import math
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk import FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

Using matplotlib backend: TkAgg
Populating the interactive namespace from numpy and matplotlib


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Zooerius\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Zooerius\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Zooerius\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
"""
Make sure these variables are correctly set.
caption_file: a directory which contains caption files
answer_file: file path to answer file
"""

caption_path = './data source/Life Is Strange 1/output'
output_file = './data source/Life Is Strange 1/output/tf-idf.json'

In [4]:
caption_files = sorted(glob.glob(os.path.join(caption_path, '**/*.txt'), recursive=True))  
print(f'Reading caption files...')
corpus = {}
for index, file in enumerate(caption_files):
    print(f'File {index}: {file}')
    with open(file, 'r') as txt:
        corpus[index] = txt.read()
        print(corpus[index], '\n')


Reading caption files...
File 0: ./data source/Life Is Strange 1/output\Episode_1_Chrysalis\screenshots\caption.txt
you happening and where is here wait there's the lighthouse I hope please let me make it there holy shit wow that was so cerejo mostly called film little pieces of time but he could be talking about photography as he likely was okay in class everything's cool I am okay from light to shadow from color to chiaroscuro now can you give me an example of a photographer who perfectly captured the human condition in he didn't fall asleep and it sure didn't feel like a dream Bueller weird Diane Arbus there you go Victoria why Arvest because of her look at this crap how can I show this to mr. Jefferson I can hear the class laughing at me now images of hopeless faces you feel like totally haunted by the eyes of those sad mothers and children she saw humanity is torture a little camera bag is battered but still kick you should keep that to yourself if anybody else looked at this what

In [7]:
def tokenize(text):
    stop_words = stopwords.words('english')
    stop_words += ['yeah', 'okay', 'yes', 'hey', 'huh']
    stop_words = set(stop_words)
    lemmatizer = nltk.WordNetLemmatizer()
    
    mapping = [ ('gotta', 'have got to'), 
               ('gonna', 'going to')]
    for k, v in mapping:
        text = text.replace(k, v)
    
    word_tokens = text.split(' ') # word_tokenize(text)
    
    # words = [lemmatizer.lemmatize(word) for word in word_tokens]
    # words = [word for word in words if word.isalpha()]
    words = [word for word in word_tokens if word.lower() not in stop_words]
    # words = [word for word in words if len(word)> 2]
    return words

In [18]:
# calculate the frequency of each word
window_size = 3
text = ' '.join(corpus.values())
tokens = tokenize(text)
total_num = len(tokens)
print(f'length of tokens: {total_num}')
combined_tokens = [' '.join(tokens[idx:idx+window_size]) for idx in range(0, len(tokens), window_size)]
total_num = len(combined_tokens)
print(f'length of combined tokens: {total_num}')
fd = FreqDist(combined_tokens)
topn_words = dict(fd.most_common())
print(f'{len(tokens)} tokens in total.', f'{len(topn_words.keys())} unique words.', topn_words, sep='\n')

length of tokens: 27253
length of combined tokens: 9085
27253 tokens in total.
8979 unique words.


In [19]:
# calculate idf
chunk_size = 100
chunks = [combined_tokens[x:x+chunk_size] for x in range(0, len(combined_tokens), chunk_size)]

idf={}
tfidf={}
for term in combined_tokens:
    freq = float(topn_words[term])
    idf[term] = math.log(float(len(chunks))/sum([1 for doc in chunks if term in doc]), 10)
    tfidf[term] = freq * idf[term]
    
tfidf_sorted = sorted(tfidf.items(), key=lambda x: x[1], reverse=True)

print(tfidf_sorted[:20])
print(tfidf_sorted[len(tfidf_sorted)-20:])

with open(output_file, 'w+') as output:
    json.dump(tfidf, output, indent=4)

[('vortex club party', 6.300356939925374), ('everyday heroes contest', 4.445760412804293), ('oh hi max', 4.445760412804293), ("I've never seen", 4.445760412804293), ('going back time', 4.445760412804293), ("can't hide forever", 4.445760412804293), ('note oh thanks', 3.9180827846421864), ('another vision town', 3.9180827846421864), ('going get wiped', 3.9180827846421864), ('die geez wrote', 3.9180827846421864), ('miss grant get', 3.9180827846421864), ('going go max', 3.9180827846421864), ("we'll talk real", 3.9180827846421864), ('Rachel dark room', 3.9180827846421864), ('dun dun dun', 3.9180827846421864), ('seeing old women', 3.9180827846421864), ('street like nice', 3.9180827846421864), ('right like real', 3.9180827846421864), ('road trip week', 3.9180827846421864), ('Nathan texted says', 3.9180827846421864)]
[("what's happened Arcadia", 1.9590413923210932), ('Bay ever since', 1.9590413923210932), ('first saved know', 1.9590413923210932), ("I've selfish think", 1.9590413923210932), ('a

In [20]:
answer_list = tfidf_sorted[:20]

In [30]:
with open(os.path.join(os.path.dirname(output_file), 'examples.json'), 'w+') as output:
    json.dump(dict(answer_list), output, indent=4)

In [None]:
"""

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

"""
words = ['max', 'know', 'like', 'get', 'Chloe', 'Max', 'David', "nathan"]

nltk.pos_tag(words)
