# Natural Language Processing (NLP) Process For Seinfeld Transcripts

This notebook will outline the process of cleaning, tokenizing, and vectorizing text transcripts of Seinfeld Season 5 Episodes. Source of transcripts: https://www.seinfeldscripts.com/seinfeld-scripts.html


In [1]:
#Import the NLTK library, tokenizer, and methods
import nltk
nltk.download('punkt')
from nltk import word_tokenize
from nltk import sent_tokenize

import pandas as pd
import numpy as np
from sklearn.manifold import TSNE

#Import visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D

import re

[nltk_data] Downloading package punkt to /Users/Alex/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read The Files

In [2]:
seinfeld_directory = 'Seinfeld_Episodes/Season_5/'

seinfeld_season_5_episodes = ['S05_E01_The_Mango.txt', 'S05_E02_The_Puffy_Shirt.txt',
                              'S05_E03_The_Glasses.txt', 'S05_E04_The_Sniffing_Accountant.txt',
                              'S05_E05_The_Bris.txt', 'S05_E06_The_Lip_Reader.txt',
                              'S05_E07_The_Non_Fat_Yogurt.txt', 'S05_E08_The_Barber.txt',
                              'S05_E09_The_Masseuse.txt', 'S05_E10_The_Cigar_Store_Indian.txt',
                              'S05_E11_The_Conversion.txt', 'S05_E12_The_Stall.txt',
                              'S05_E13_The_Dinner_Party.txt', 'S05_E14_The_Marine_Biologist.txt',
                              'S05_E15_The_Pie.txt', 'S05_E16_The_Stand-In.txt',
                              'S05_E17_The_Wife.txt', 'S05_E18_The_Raincoats_Part_1.txt',
                              'S05_E19_The_Raincoats_Part_2.txt', 'S05_E20_The_Fire.txt',
                              'S05_E21_The_Hamptons.txt', 'S05_E22_The Opposite.txt']


In [3]:
with open(seinfeld_directory + seinfeld_season_5_episodes[1], 'r') as file:
    raw_text_episode_2 = file.read().replace('\n', ' ')
    #.replace('[','(').replace(']',')')

In [4]:
raw_text_episode_2[:1050]

"[Setting: Jerry's apartment] (Jerry and George are waiting for Kramer, so he can help them move George's stuff back into his parent's house) GEORGE: I can't believe this! JERRY: Oh, it won't be for that long. GEORGE: How can I do this?! How can I move back in with those people? Please, tell me! They're insane! You know that. JERRY: Hey, my parents are just as crazy as your parents. GEORGE: How can you compare you parents to my parents?! JERRY: My father has never thrown anything out. Ever! GEORGE: My father wears his sneakers in the pool! Sneakers! JERRY: My mother has never set foot in a natural body of water. GEORGE: (Showing Jerry up) Listen carefully. My mother has never laughed. Ever. Not a giggle, not a chuckle, not a tee-hee.. never went 'Ha!' JERRY: A smirk? GEORGE: Maybe!.. And I'm moving back in there! JERRY: I told you I'd lend you the money for the rent. GEORGE: No, no, no, no. Borrowing money from a friend is like having sex. It just completely changes the relationship. (

## Cleaning The Text Data

#Step 1: Manual

In [6]:
def clean_text(raw_text):
    raw_text_quotes = re.sub("[\(\[].*?[\)\]]", "", raw_text)
    cleaned_text = []
    for word in raw_text_quotes.split(" "):
        if not ":" in word:
            for symbol in ".,?!'*":
                word = word.replace(symbol, '').lower()
            #Checks for blank elements
            if word:
                cleaned_text.append(word)

    return cleaned_text

In [7]:
clean_text_episode_2 = clean_text(raw_text_episode_2)
' '.join(clean_text_episode_2[:200])

'i cant believe this oh it wont be for that long how can i do this how can i move back in with those people please tell me theyre insane you know that hey my parents are just as crazy as your parents how can you compare you parents to my parents my father has never thrown anything out ever my father wears his sneakers in the pool sneakers my mother has never set foot in a natural body of water listen carefully my mother has never laughed ever not a giggle not a chuckle not a tee-hee never went ha a smirk maybe and im moving back in there i told you id lend you the money for the rent no no no no borrowing money from a friend is like having sex it just completely changes the relationship alright im ready you know i still dont understand - why do you want to move back in with your parents i dont want to im outta money i got 714 dollars left in the bank well move in here whats that why doesnt he just move in here yeah yeah im gonna move in with him he doesnt even'

In [8]:
for i in seinfeld_season_5_episodes:
    
    with open(seinfeld_directory + i, 'r') as file:
        raw_text_episode = file.read().replace('\n', ' ')
    
    print(i + ' ' + str(type(raw_text_episode)) + '\n')
    
    clean_text_episode = clean_text(raw_text_episode)
    print(' '.join(clean_text_episode[:200]) + '\n')

S05_E01_The_Mango.txt <class 'str'>

a female orgasm is kinda like the bat cave a very few people know where it is and if youre lucky enough to see it you probably dont know how you got there and you cant find you way back after you left you know there are two types of female the real and the fake and ill tell you right now as a man we dont know we do not know because to man sex is like a car accident and determining the female orgasm is like being asked what did you see after the car went out of control i heard a lot of screeching sounds i remember i was facing the wrong way at one point and in the end my body was thrown clear so whats her name karin is she nice great so you like her i think so you dont know i cant tell anymore well do you feel anything feel whats that all right let me ask you when she comes over youre cleaning up a lot yeah youre just straightening up or youre cleaning cleaning you do the tub yeah on your knees ajax scrubbing the whole deal yeah okay i think

S05_E02

## Tokenize The Sentence

In [9]:
#Uses the NLP tokenize method to tokenize the script

def tokenize(cleaned_text):
    joined_sentence = ' '.join(cleaned_text)
    tokenized_sentence = word_tokenize(joined_sentence)
    
    return tokenized_sentence

In [10]:
tokenized_text_episode_2 = tokenize(clean_text_episode_2)
' '.join(tokenized_text_episode_2[:200])

'i cant believe this oh it wont be for that long how can i do this how can i move back in with those people please tell me theyre insane you know that hey my parents are just as crazy as your parents how can you compare you parents to my parents my father has never thrown anything out ever my father wears his sneakers in the pool sneakers my mother has never set foot in a natural body of water listen carefully my mother has never laughed ever not a giggle not a chuckle not a tee-hee never went ha a smirk maybe and im moving back in there i told you id lend you the money for the rent no no no no borrowing money from a friend is like having sex it just completely changes the relationship alright im ready you know i still dont understand - why do you want to move back in with your parents i dont want to im outta money i got 714 dollars left in the bank well move in here whats that why doesnt he just move in here yeah yeah im gon na move in with him he doesnt'

# Vectorization

## Method 1: Count Vectorization

In [11]:
#(Note: This function is adapted from a guide by 'Learn.Co' titled 'Word Vectorization - Lab')

def count_vectorize(episode, vocab=None):
    if vocab:
        unique_words = vocab
    else:
        unique_words = list(set(episode))
    
    episode_dict = {i:0 for i in unique_words}
    
    for word in episode:
        episode_dict[word] += 1
    
    return episode_dict

vectorized_episode_2 = count_vectorize(tokenized_text_episode_2)
print(vectorized_episode_2)

{'really': 4, 'exquisite': 2, 'smirk': 1, 'hell': 1, 'soft': 2, 'shes': 6, 'sitting': 1, 'talking': 5, 'mention': 1, 'tee-hee': 1, 'cant': 7, 'getting': 2, 'delicate': 1, 'nice': 1, 'compassionate': 1, 'ha': 1, 'plenty': 1, 'frank': 1, 'maybe': 3, 'true': 2, 'cares': 1, 'today': 5, 'help': 2, 'oven': 1, 'very': 4, 'new': 6, 'well': 13, 'them': 11, 'stocking': 1, 'pay': 1, 'acupuncturists': 1, 'tv': 4, 'hair': 1, 'to': 58, 'fact': 1, 'i': 90, 'more': 6, 'excuse': 2, 'tip-top': 1, 'terrific': 1, 'ridiculous': 2, 'for': 10, 'puffy': 11, 'check': 1, 'neither': 1, 'life': 1, 'with': 15, 'want': 11, 'one': 8, 'now': 7, 'kind': 2, 'not': 11, 'got': 13, 'mentioned': 2, 'mailman': 1, 'giddy-up': 1, 'bop': 1, 'hope': 2, 'young': 1, 'count': 1, 'nodding': 1, 'over': 2, 'model': 2, 'dollar': 3, 'volunteer': 1, 'point': 1, 'session': 1, 'puffed': 2, 'uhh': 1, 'this': 25, 'match': 1, 'ta': 4, 'bet': 1, 'men': 1, 'lend': 1, 'lot': 2, 'moving': 1, 'was': 14, 'up': 7, 'kinda': 2, 'fashions': 1, 'still'

## Method 2: Vectorization with TF-IDF

In [12]:
#Takes in a dictionary representing a document where the keys are unique words
#and the values are the count of those words

#Returns the term frequency as the value for those words
#(number of times word appears in a document)/(total number of words in a document)
#NOTE: The denominator is not total word count of the document it is total unique words

def term_frequency(BoW_dict):
    total_word_count = sum(BoW_dict.values())
    
    for ind, val in BoW_dict.items():
        BoW_dict[ind] = val/ total_word_count
    
    return BoW_dict

episode_2_term_frequency = term_frequency(vectorized_episode_2)

In [13]:
from collections import Counter

for i in Counter(episode_2_term_frequency).most_common(10): 
    print(i[0]," :",i[1]," ") 

i  : 0.0407055630936228  
you  : 0.03934871099050204  
the  : 0.031659882406151064  
to  : 0.026232473993668022  
a  : 0.024423337856173677  
it  : 0.016282225237449117  
what  : 0.014473089099954772  
in  : 0.013116236996834011  
that  : 0.012663952962460425  
this  : 0.011307100859339666  


In [14]:
#Takes in a list of dictionaries where the elements are dictionaries representing each
#document and where the keys are unique words and the values are the count of those words

#Returns the Inverse Document frequency for each term in the Corpus:
#IDF(term) = log_e(Total Number of Documents in the Corpus/
#                  Number of Documents with 'Term' in It)
#Note: This function returns a dictionary where each key is a Corpus' unique term
#and the corresponding value is the IDF of that term

#(Note: This function is adapted from a guide by 'Learn.Co' titled 'Word Vectorization - Lab')

def inverse_document_frequency(list_of_dicts):
    
    # Creates a set of all unique words in the corpus
    vocab_set = set()
    # Iterate through list of dfs and add index to vocab_set
    for d in list_of_dicts:
        for word in d.keys():
            vocab_set.add(word)
    
    # Once vocab set is complete, create an empty dictionary with a key for each word and value of 0.
    full_vocab_dict = {i:0 for i in vocab_set}
    
    # Loop through each word in full_vocab_dict
    for word, val in full_vocab_dict.items():
        docs = 0
        
        # Loop through list of dicts.  Each time a dictionary contains the word, increment docs by 1
        for d in list_of_dicts:
            if word in d:
                docs += 1
        
        # Now that we know denominator for equation, compute and set IDF value for word
        
        full_vocab_dict[word] = np.log((len(list_of_dicts)/ float(docs)))
    
    return full_vocab_dict

In [15]:
#Takes in a list of dictionaries where the elements are dictionaries representing each
#document and where the keys are unique words and the values are the count of those words

#Retuns a list of dictionaries where the elements are dictionaries representing each
#document and where each dictionary's values represent a term's TF-IDF values for
#that document

#(Note: This function is adapted from a guide by 'Learn.Co' titled 'Word Vectorization - Lab')

def tf_idf(list_of_dicts):
    # Create empty dictionary containing full vocabulary of entire corpus
    doc_tf_idf = {}
    idf = inverse_document_frequency(list_of_dicts)
    full_vocab_list = {i:0 for i in list(idf.keys())}
    
    # Create tf-idf list of dictionaries, containing a dictionary that will be updated for each document
    tf_idf_list_of_dicts = []
    
    # Now, compute tf and then use this to compute and set tf-idf values for each document
    for doc in list_of_dicts:
        
        doc_tf = term_frequency(doc)
        #(number of times word appears in a document)/(total number of words in a document)

        #Iterates through each key in one document's TF dictionary
        for word in doc_tf:
            doc_tf_idf[word] = doc_tf[word] * idf[word]
        tf_idf_list_of_dicts.append(doc_tf_idf)
    
    return tf_idf_list_of_dicts

In [16]:
def main(filenames):
    # Iterate through list of filenames and read each in
    count_vectorized_all_documents = []
    for file in filenames:
        with open(seinfeld_directory + file) as f:
            raw_data = f.readlines()
            
        print(file)
        # Clean and tokenize raw text
        cleaned = clean_text(raw_data)
        tokenized = tokenize(cleaned)
        
        # Get count vectorized representation and store in count_vectorized_all_documents  
        count_vectorized_document = count_vectorize(tokenized)
        count_vectorized_all_documents.append(count_vectorized_document)
    
    # Now that we have a list of BoW respresentations of each song, create a tf-idf representation of everything
    tf_idf_all_docs = tf_idf(count_vectorized_all_documents)
    
    return tf_idf_all_docs

tf_idf_all_docs = main(seinfeld_season_5_episodes)
print(list(tf_idf_all_docs[0])[:10])

S05_E01_The_Mango.txt


TypeError: expected string or bytes-like object