## Preprocesser

1. Open `corpus.txt`, read it line by line.
2. Eliminate everything that is placed in parentheses (stage directions).
3. If line is a character's line, extract character's name and line; add to the dictionary:
    1. if character's name exists in the dictionary, add to list of utts by preexisting character;
    2. if character's name is not in the dictionary, add new character entry.
    
1. Make additional dictionary of BOWs (lists within the list).
    1. Create new dictionary;
    1. for each entry in the utts_dictionary, open each sentence, tokenize it, lemmatize it, remove punctuation and stop_words;
    1. then convert sentence into BOW and add it to list;
    1. append list of bows and character as key to dictionary.

In [1]:
import re
import pickle
from seinfeld_functions import lemmatize_with_pos, preprocess_sentence, bow_sentence

In [2]:
utterances_dictionary = {}
bows_dictionary = {}
total_bows_per_character = {}
characters_line_regex = "^([A-Z]+): (.{1,250})$"

In [3]:
with open("corpus.txt", "r", encoding="utf8") as corpus:
    for line in corpus.readlines():
        cleaned_line = re.sub("\(.+\)", "", line)
        cleaned_line = re.sub("\[.+\]", "", line)
        match_line = re.search(characters_line_regex, cleaned_line)
        if match_line and match_line.group(1) in utterances_dictionary:
            utterances_dictionary[match_line.group(1)].append(match_line.group(2))
        elif match_line:
            utterances_dictionary[match_line.group(1)] = [match_line.group(2)]

In [4]:
for key in utterances_dictionary:
    bow_list = []
    characters_bow = set() 
    for utterance in utterances_dictionary[key]:
        preprocessed_utterance = preprocess_sentence(utterance)
        utterance_bow = bow_sentence(preprocessed_utterance)
        for word in utterance_bow:
            characters_bow.add(word)
        bow_list.append(utterance_bow)
    bows_dictionary[key] = bow_list
    total_bows_per_character[key] = characters_bow
    
# print(bows_dictionary["NEWMAN"])
# print(total_bows_per_character["NEWMAN"])

In [5]:
pickle.dump(utterances_dictionary, open("seinfeld_utterances_dictionary.p", "wb"))
pickle.dump(bows_dictionary, open("seinfeld_bows_dictionary.p", "wb"))
pickle.dump(total_bows_per_character, open("seinfeld_bows_per_character.p", "wb"))