My first attempt at preprocessing resulted in screenplays that were still much too large for neural networks.  Here I'll try to aggressively cut down the size of the screenplays as much as possible, while hopefully still retaining relevant info.

# 0. Import Data

In [None]:
import pandas as pd
import numpy as np

root_path = r'C:\\Users\bened\DataScience\ANLP\AT2\\36118_NLP_Spring\\CSVs'

df_aus = pd.read_csv(f'{root_path}\\df_aus.csv', index_col=0)
df_aus.head()

In [None]:
df_aus.columns

In [None]:
roxbury = df_aus['screenplay'][0][:1000]
roxbury

In [None]:
print(roxbury)

In [None]:
print(df_aus['screenplay'][0][1000:2000]) 

# 1. Text Analysis

## Sentence Tokenizer

Before taking any additional steps, we're going to try to train a sentence tokenizer on the whole corpus, for later use.

In [None]:
! pip install nltk

In [None]:
# build corpus

corpus = ("\n"*10).join(df_aus['screenplay'].values)
corpus

In [None]:
print(corpus[:1000])
print("-"*50)
print(corpus[(len(corpus)-1000):len(corpus)])

In [None]:
screenplay_tokenizer = PunktSentenceTokenizer(train_text=corpus)

In [None]:
import pickle

with open('sentence_tokenizer.pkl', 'wb') as f:
  pickle.dump(screenplay_tokenizer, f)

Let's take a look at the first screenplay in the data, 'A Night at the Roxbury', to get a sense of how to approach this.

In [None]:
roxbury_raw = df_aus.at[0, 'screenplay']
roxbury_raw[:1000]

- What's the first sequence here that might actually be relevant?  I would say perhaps the lines "night falls and partytime begins.
- Formatting:  The first several lines are basically metadata.  This is signified by subsequent strings of \n\t.  
- The string "FADE IN" or "EXT." is roughly where the screenplay proper begins.

Let's look at a random sampling to determine a pattern.

In [None]:
import random

screenplay_texts = list(df_aus['screenplay'].values)

screenplay_sample = random.sample(screenplay_texts, 10)

# print first 1000 chars of each screenplay

for s in screenplay_sample:
  print(s[:1000])
  print("-"*50)

- For actual screenplays, the action seems to begin with "EXT" or "INT".  Some of the data here are scripts, which don't follow this format.  So ideally, we truncate everything before EXT|INT, unless there's no match for either, in which case we truncate nothing.

## truncate opening metadata

In [None]:
import re

pat = re.compile(r'EXT|INT')

def truncate_metadata(screenplay):
  match = re.search(pat, screenplay)
  if match:
    cutoff = match.end() + 1
    # return the string from start
    return screenplay[cutoff:]
  # else return the whole screenplay
  else:
    return screenplay

# beta test
roxbury_truncated = truncate_metadata(roxbury_raw)
print(roxbury_truncated[:1000])

We'll also keep updating stopwords bank as we go along.

In [None]:
print(stop_words)

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [None]:
# beta test for a script
df_aus[df_aus['title'] == 'The Nightmare Before Christmas']
nightmare = df_aus.at[787, 'screenplay']
nightmare_truncated = truncate_metadata(nightmare)
print(nightmare_truncated[:1000])

In [None]:
# seems to be working okay, so let's apply to all screenplays
truncated_screenplays = df_aus['screenplay'].apply(truncate_metadata)
truncated_screenplays

let's now see if there are any patterns to the end of screenplays

In [None]:
screenplay_sample = random.sample(screenplay_texts, 10)

# print first 1000 chars of each screenplay

for s in screenplay_sample:
  print(s[(len(s)-1000):len(s)])
  print("-"*50)

we wouldn't really be removing enough words here for it to be worthwhile.

## remove allcaps

In general, it seems that words in all capital letters are either character names or provide photographic direction.  Neither are really releavnt to us. We will first tokenize into words, and then remove words that are all-caps.

In [None]:
# first I want to see how nltk.word_tokenize will behave
from nltk.tokenize import word_tokenize

roxbury_tokens = word_tokenize(roxbury_truncated, preserve_line=True)
print(roxbury_tokens[:1000])

In [None]:
# filter out capital letters
def filter_allcaps(tokens):
  filtered_tokens = []
  for t in tokens:
    if t.isupper() == False:
      filtered_tokens.append(t)
    else:
      continue
  return filtered_tokens

roxbury_lower = filter_allcaps(roxbury_tokens)
print(roxbury_lower[:1000])

In [None]:
# now filter out punctuations
import string

puncts = list(string.punctuation)
puncts.extend([r'``', r'--', r'...', r"''"])
# print(puncts)

def filter_puncts(tokens):
  filtered_tokens = []
  for t in tokens:
    if t not in puncts:
      filtered_tokens.append(t)
  return filtered_tokens

roxbury_unpunctuated = filter_puncts(roxbury_lower)
print(roxbury_unpunctuated[:1000])

In [None]:
# filter out stopwords
def remove_stopwords(tokens):
  tokens_nonstop = []
  for t in tokens:
    if t not in stop_words:
      tokens_nonstop.append(t)
  return tokens_nonstop

roxbury_nonstop = remove_stopwords(roxbury_unpunctuated)
print(roxbury_nonstop[:1000])

more will be removed after converting to lowercase, but I'm not sure I want to do this yet because want to preserve sentence boundaries.

First though we can remove all numbers.

In [None]:
def remove_numbers(tokens):
  filtered_tokens = []
  for t in tokens:
    if t.isalpha():
      filtered_tokens.append(t)
  return filtered_tokens

roxbury_alpha = remove_numbers(roxbury_nonstop)
print(roxbury_alpha[:1000])

# Now we'll rejoin the text so we can sentence tokenize

In [None]:
from nltk.tokenize.punkt import PunktSentenceTokenizer

sentence_tokenizer = PunktSentenceTokenizer()
roxbury_sentences = sentence_tokenizer.sentences_from_tokens(tokens=roxbury_alpha)
print(roxbury_sentences[:100])