Step 6 - Extract the most important content from caption text

The most important words from dialogues can in some ways show the gist of what a game is talking about. This script is used to extract some of such words using nltk.

Prerequisites: nltk
 In case of errors: manually download the required datasets to:
 
 C:\nltk_data (Windows) 
 
 /usr/local/share/nltk_data (Mac)
 
 /usr/share/nltk_data (Unix)
 

Input:

1. caption text

Output:

1. A list of words and their corresponding vectors

In [1]:
%pylab
import os, glob
import json
import math
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk import FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib
[nltk_data] Downloading package punkt to /usr/local/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [110]:
"""
Make sure these variables are correctly set.
caption_file: a directory which contains caption files
answer_file: file path to answer file
"""

caption_path = './visualization/backend/datasource/The Last Of Us/output'
output_file = './visualization/backend/datasource/The Last Of Us/output/tf-idf.json'

In [111]:
def tokenize(text):
    stop_words = stopwords.words('english')
    stop_words += ['yeah', 'okay', 'yes', 'hey', 'huh', "let", 
                   "something", "everything", "nothing", "anything", 
                   "thing", "guy", "alright", "look", "get", "way", 
                   "well", "good", "please", "thank","seen", "dun", "done", "said",
                  "tell", "thanks"]
    stop_words = set(stop_words)
    lemmatizer = nltk.WordNetLemmatizer()
    
    mapping = [ ('gotta', 'have got to'), 
               ('gonna', 'going to')]
    for k, v in mapping:
        text = text.replace(k, v)
    
    word_tokens = text.split(' ') # word_tokenize(text)
    
    words = [lemmatizer.lemmatize(word) for word in word_tokens]
    words = [word for word in words if word.isalpha()]
    words = [word for word in words if word.lower() not in stop_words]
    words = [word for word in words if len(word)> 2]
    return words

In [112]:
caption_files = sorted(glob.glob(os.path.join(caption_path, '**/*.txt'), recursive=True))  
print(f'Reading caption files...')
corpus = {}
for index, file in enumerate(caption_files):
    print(f'File {index}: {file}')
    with open(file, 'r') as txt:
        corpus[index] = txt.read()
        print(corpus[index], '\n')


Reading caption files...
File 0: ./visualization/backend/datasource/The Last Of Us/output/The_Last_Of_Us/screenshots/caption.txt



In [113]:
# calculate the frequency of each word
text = ' '.join(corpus.values())
tokens = tokenize(text)
total_num = len(tokens)
fd = FreqDist(tokens)
topn_words = dict(fd.most_common())
print(f'{len(tokens)} tokens in total.', f'{len(topn_words.keys())} unique words.', topn_words, sep='\n')

4300 tokens in total.
1339 unique words.


In [114]:
# find nouns

tags = nltk.pos_tag(topn_words.keys())
nouns = [tag[0] for tag in tags if tag[1] in ['NNP', 'NN', 'NNS', 'NNPS']]
print(nouns)



In [115]:
# calculate idf
chunk_size = 200
chunks = [tokens[x:x+chunk_size] for x in range(0, len(tokens), chunk_size)]
   
idf={}
tfidf={}
# for term in nouns:
for term, freq in topn_words.items():
        idf[term] = math.log(len(chunks)/sum([1 for doc in chunks if term in doc]), 10)
        tfidf[term] = freq * idf[term]
    
tfidf_sorted = sorted(tfidf.items(), key=lambda x: x[1], reverse=True)

with open(output_file, 'w+') as output:
    json.dump(tfidf, output, indent=4)

In [116]:
answer_list = [word[0] for word in tfidf_sorted[:10]]
print(answer_list)

['Tommy', 'bill', 'Sam', 'Robert', 'gun', 'Henry', 'fuck', 'stay', 'talk', 'car']


In [117]:
with open(os.path.join(os.path.dirname(output_file), 'example.json'), 'w+') as output:
    json.dump(answer_list, output, indent=4)

In [None]:
"""

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

"""
words = ['max', 'know', 'like', 'get', 'Chloe', 'Max', 'David', "nathan"]

nltk.pos_tag(words)
